knitr::opts_chunk$set( echo = FALSE, warning = FALSE, message = FALSE)
In this project, it will be predicted some of products that are sold at an e-commerce platform called ‘Trendyol’. The sold count will be examined for each product and data will be decomposed. Then, some forecasting strategies will be developed and the best among them according to their weighted mean absolute errors will be picked. The data before 29 May 2021 will be train dataset for our models to learn and data from 29 May to 11 June 2021 will be test dataset. There are 9 products that it will be examined:
Since campaign dates are important for the sales and most peaks in sales happen during these times,as external data, campaign dates of the Trendyol is investigated and included as input attribute ‘is_campaign’. The data is taken from Trendyol’s website.
Before making forecasting models, it should be looked at the plot of data and examined the seasonalities and trend. Below, you can see the plot of sales quantity of Product 1. There is a slightly increasing trend, especially in the middle of the plot. There can’t be seen any significant seasonality. To look further, there is a plot of 3 months of 2021 - March, April and May -. Again, the seasonality isn’t very significant but it is seen that the data is higher in the beginning of the month and decreases to the end of the month. It can be said that there is monthly seasonality.
First type of model that is going to used is linear regression model. First of all, it would be wise to select attributes that will help to model from correlation matrix. Below, you can see the correlations between the attributes. According to this matrix, category_sold, category_favored, and basket_count can be added to the model.
In the first model, the attributes are added to the model. The adjusted R-squared value indicates whether model is good or not. The value for the first model is pretty high which is a good sign. But there are outliers which is probably due to campaigns and holidays. The outliers can be eliminated for a better model. Lastly, ‘lag1’ attribute can be added because it is very high in the ACF. In the final linear regression model, adjusted R-squared value is high enough and plots are good enough to make predictions.
##
## Call:
## lm(formula = sold_count ~ category_sold + category_favored +
## basket_count, data = sold)
##
## Residuals:
## Min 1Q Median 3Q Max
## -86.278 -11.238 -0.387 8.763 168.980
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.7442865 2.8040394 1.692 0.0915 .
## category_sold 0.1187613 0.0062677 18.948 < 2e-16 ***
## category_favored -0.0015302 0.0002083 -7.347 1.34e-12 ***
## basket_count 0.1407651 0.0090971 15.474 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.46 on 365 degrees of freedom
## Multiple R-squared: 0.8403, Adjusted R-squared: 0.839
## F-statistic: 640.4 on 3 and 365 DF, p-value: < 2.2e-16
##
## Breusch-Godfrey test for serial correlation of order up to 10
##
## data: Residuals
## LM test = 140.82, df = 10, p-value < 2.2e-16
## sold_count
## Min. : 14.00
## 1st Qu.: 33.00
## Median : 56.00
## Mean : 74.17
## 3rd Qu.: 89.00
## Max. :447.00
##
## Call:
## lm(formula = sold_count ~ big_outlier + category_sold + category_favored +
## basket_count, data = sold)
##
## Residuals:
## Min 1Q Median 3Q Max
## -80.651 -8.335 -1.034 8.277 121.209
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.5878617 2.3643596 4.901 1.44e-06 ***
## big_outlier 76.5329182 5.7826657 13.235 < 2e-16 ***
## category_sold 0.0867377 0.0056964 15.227 < 2e-16 ***
## category_favored -0.0008900 0.0001781 -4.998 9.01e-07 ***
## basket_count 0.1075103 0.0078954 13.617 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.95 on 364 degrees of freedom
## Multiple R-squared: 0.8922, Adjusted R-squared: 0.891
## F-statistic: 753.2 on 4 and 364 DF, p-value: < 2.2e-16
##
## Breusch-Godfrey test for serial correlation of order up to 10
##
## data: Residuals
## LM test = 112.47, df = 10, p-value < 2.2e-16
##
## Call:
## lm(formula = sold_count ~ lag1 + big_outlier + category_sold +
## category_favored + basket_count, data = sold)
##
## Residuals:
## Min 1Q Median 3Q Max
## -78.630 -7.746 -0.706 7.253 123.997
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.9269325 2.0130544 4.931 1.25e-06 ***
## lag1 0.5443102 0.0457488 11.898 < 2e-16 ***
## big_outlier 63.1752763 5.0382831 12.539 < 2e-16 ***
## category_sold 0.0940932 0.0048777 19.290 < 2e-16 ***
## category_favored -0.0009748 0.0001514 -6.438 3.84e-10 ***
## basket_count 0.1106151 0.0067112 16.482 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.79 on 363 degrees of freedom
## Multiple R-squared: 0.9225, Adjusted R-squared: 0.9214
## F-statistic: 863.6 on 5 and 363 DF, p-value: < 2.2e-16
##
## Breusch-Godfrey test for serial correlation of order up to 10
##
## data: Residuals
## LM test = 18.357, df = 10, p-value = 0.04924
Second type of model that is going to build is ARIMA model. For this model, in the beginning, the data should be decomposed. Firstly, a frequency value should be chosen. Since there is no significant seasonality, the highest value in the ACF will be chosen which is 63. Additive type of decomposition will be used for this task. Below, the random series can be seen.
After the decomposition, (p,d,q) values should be chosen for the model. For this task, ACF and PACF will be examined. Looking at the ACF, for ‘q’ value 1 or 7 can be chosen and looking at the PACF, for ‘p’ value 1 can be chosen. Also, auto.arima function is used as well. The AIC and BIC values of models that are suggested can be seen below. Smaller AIC and BIC values means the model is better. So, looking at AIC and BIC values, (2,0,2) model that auto.arima is suggested is best among them. After the model is selected, the regressors that most correlates with the sold count are added to model to make it better. In the final model, the AIC and BIC values are lower. We can proceed with this model.
##
## Call:
## arima(x = detrend, order = c(1, 0, 1))
##
## Coefficients:
## ar1 ma1 intercept
## 0.6650 0.0123 -1.5566
## s.e. 0.0574 0.0702 6.0436
##
## sigma^2 estimated as 1244: log likelihood = -1529.77, aic = 3067.54
## [1] 3067.536
## [1] 3082.443
##
## Call:
## arima(x = detrend, order = c(1, 0, 7))
##
## Coefficients:
## ar1 ma1 ma2 ma3 ma4 ma5 ma6 ma7
## 0.8658 -0.2496 -0.0680 -0.1138 -0.2193 -0.1632 -0.0457 -0.1405
## s.e. 0.0427 0.0696 0.0622 0.0643 0.0589 0.0551 0.0697 0.0702
## intercept
## -0.4768
## s.e. 0.5468
##
## sigma^2 estimated as 1129: log likelihood = -1516.43, aic = 3052.87
## [1] 3052.868
## [1] 3090.136
## Series: detrend
## ARIMA(2,0,2) with zero mean
##
## Coefficients:
## ar1 ar2 ma1 ma2
## 1.5221 -0.6871 -0.8673 0.1966
## s.e. 0.1703 0.0984 0.1811 0.0930
##
## sigma^2 estimated as 1201: log likelihood=-1522.43
## AIC=3054.86 AICc=3055.06 BIC=3073.5
## [1] 3054.864
## [1] 3073.498
##
## Call:
## arima(x = detrend, order = c(2, 0, 2), xreg = xreg)
##
## Coefficients:
## ar1 ar2 ma1 ma2 intercept xreg1 xreg2
## 0.8477 -0.1219 -0.1993 0.1917 -52.5534 0.1673 -2e-04
## s.e. 0.2838 0.2328 0.2780 0.0934 7.9501 0.0180 3e-04
##
## sigma^2 estimated as 780.6: log likelihood = -1458.35, aic = 2932.71
## [1] 2932.707
## [1] 2962.521
We selected two models for prediction. Here, it can be seen their accuracy values. According to box plot, the variance of weighted mean absolute errors for linear model is higher especially in the end. We should choose Arima model because WMAPE value of the model is lower which is a sign for better model.
## variable n mean sd CV FBias MAPE RMSE
## 1: lm_prediction 14 83.35714 17.09074 0.2050303 -0.72352232 0.8010225 109.8228
## 2: selected_arima 14 83.35714 17.09074 0.2050303 -0.03885441 0.3287008 35.2479
## MAD MADP WMAPE
## 1: 63.38325 0.7603817 0.7603817
## 2: 26.33523 0.3159325 0.3159325
For conclusion, here is a plot of actual test set and predicted values of chosen model. As it can be seen, the predictions are pretty accurate.
Before making forecasting models for product 2, it should be looked at the plot of data and examined the seasonalities and trend. Below, you can see the plot of sales quantity of Product 2. There isn’t a significant trend as it can be seen. Also, there can’t be seen any significant seasonality. To look further, there is a plot of 3 months of 2021 - March, April and May -. Again, the seasonality isn’t significant, though it can be said there is a spike in the plot at the beginning of the month. In May, there is a big rising probably due to Covid-19 conditions. In conclusion, it can be said that there is monthly seasonality but it isn’t very clear.
First type of model that is going to used is linear regression model. First of all, it would be wise to select attributes that will help to model from correlation matrix. Below, you can see the correlations between the attributes. According to this matrix, category_sold, category_visits, and basket_count can be added to the model.
In the first model, the attributes are added to the model. The adjusted R-squared value indicates whether model is good or not. The value for the first model is pretty high which is a good sign. But there are outliers which is probably due to campaigns and holidays. The outliers can be eliminated for a better model. Lastly, ‘lag1’ attribute can be added because it is very high in the ACF. In the final linear regression model, adjusted R-squared value is high enough and plots are good enough to make predictions.
##
## Call:
## lm(formula = sold_count ~ category_sold + category_visits + basket_count,
## data = sold)
##
## Residuals:
## Min 1Q Median 3Q Max
## -422.40 -60.15 1.95 63.20 1208.91
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -60.17090 11.42700 -5.266 2.39e-07 ***
## category_sold 0.14185 0.02200 6.449 3.58e-10 ***
## category_visits 0.00693 0.01256 0.552 0.581
## basket_count 0.18780 0.01162 16.161 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 128.9 on 365 degrees of freedom
## Multiple R-squared: 0.9068, Adjusted R-squared: 0.906
## F-statistic: 1183 on 3 and 365 DF, p-value: < 2.2e-16
##
## Breusch-Godfrey test for serial correlation of order up to 10
##
## data: Residuals
## LM test = 125.12, df = 10, p-value < 2.2e-16
## sold_count
## Min. : 30.0
## 1st Qu.: 165.0
## Median : 238.0
## Mean : 381.4
## 3rd Qu.: 431.0
## Max. :4191.0
##
## Call:
## lm(formula = sold_count ~ big_outlier + category_sold + category_visits +
## basket_count, data = sold)
##
## Residuals:
## Min 1Q Median 3Q Max
## -356.35 -52.28 10.07 53.54 1315.86
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.148e+01 1.241e+01 -1.730 0.0845 .
## big_outlier 2.303e+02 3.592e+01 6.410 4.51e-10 ***
## category_sold 1.425e-01 2.088e-02 6.824 3.71e-11 ***
## category_visits -4.873e-04 1.198e-02 -0.041 0.9676
## basket_count 1.477e-01 1.268e-02 11.655 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 122.4 on 364 degrees of freedom
## Multiple R-squared: 0.9162, Adjusted R-squared: 0.9153
## F-statistic: 995.2 on 4 and 364 DF, p-value: < 2.2e-16
##
## Breusch-Godfrey test for serial correlation of order up to 10
##
## data: Residuals
## LM test = 95.607, df = 10, p-value = 4.11e-16
##
## Call:
## lm(formula = sold_count ~ lag1 + big_outlier + category_sold +
## category_visits + basket_count, data = sold)
##
## Residuals:
## Min 1Q Median 3Q Max
## -381.58 -37.12 4.89 39.84 1334.45
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -40.28635 11.39508 -3.535 0.00046 ***
## lag1 0.44599 0.04880 9.140 < 2e-16 ***
## big_outlier 178.62606 32.91952 5.426 1.05e-07 ***
## category_sold 0.13014 0.01890 6.886 2.54e-11 ***
## category_visits 0.01271 0.01091 1.165 0.24494
## basket_count 0.15168 0.01145 13.244 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 110.5 on 363 degrees of freedom
## Multiple R-squared: 0.9319, Adjusted R-squared: 0.931
## F-statistic: 993.4 on 5 and 363 DF, p-value: < 2.2e-16
##
## Breusch-Godfrey test for serial correlation of order up to 10
##
## data: Residuals
## LM test = 74.502, df = 10, p-value = 5.947e-12
Second type of model that is going to build is ARIMA model. For this model, in the beginning, the data should be decomposed. Firstly, a frequency value should be chosen. Since there is no significant seasonality, the highest value in the ACF will be chosen which is 34. Additive type of decomposition will be used for this task. Below, the random series can be seen.
After the decomposition, (p,d,q) values should be chosen for the model. For this task, ACF and PACF will be examined. Looking at the ACF, for ‘q’ value 1 or 11 can be chosen and looking at the PACF, for ‘p’ value 1 can be chosen. Also, auto.arima function is used as well. The AIC and BIC values of models that are suggested can be seen below. Looking at AIC and BIC values, (1,0,11) model is best among them. After the model is selected, the regressors that most correlates with the sold count are added to model to make it better. In the final model, the AIC and BIC values are lower. We can proceed with this model.
##
## Call:
## arima(x = detrend, order = c(1, 0, 1))
##
## Coefficients:
## ar1 ma1 intercept
## 0.5985 0.1204 -2.2120
## s.e. 0.0598 0.0686 45.0812
##
## sigma^2 estimated as 88277: log likelihood = -2383.17, aic = 4774.34
## [1] 4774.343
## [1] 4789.6
##
## Call:
## arima(x = detrend, order = c(1, 0, 11))
##
## Coefficients:
## ar1 ma1 ma2 ma3 ma4 ma5 ma6 ma7
## 0.5115 0.0898 0.0048 -0.1392 -0.1806 -0.2103 -0.1589 -0.1076
## s.e. 0.2066 0.2088 0.1286 0.0770 0.0556 0.0745 0.0945 0.0925
## ma8 ma9 ma10 ma11 intercept
## -0.0942 -0.0735 -0.0572 -0.0731 0.3060
## s.e. 0.0784 0.0771 0.0727 0.0640 2.0291
##
## sigma^2 estimated as 76841: log likelihood = -2361.76, aic = 4751.51
## [1] 4751.515
## [1] 4804.913
## Series: detrend
## ARIMA(3,0,0) with zero mean
##
## Coefficients:
## ar1 ar2 ar3
## 0.7228 -0.0081 -0.1412
## s.e. 0.0540 0.0669 0.0539
##
## sigma^2 estimated as 86941: log likelihood=-2379.15
## AIC=4766.29 AICc=4766.41 BIC=4781.55
## [1] 4766.292
## [1] 4781.549
##
## Call:
## arima(x = detrend, order = c(1, 0, 11), xreg = xreg)
##
## Coefficients:
## ar1 ma1 ma2 ma3 ma4 ma5 ma6 ma7 ma8
## 0.5558 0.1483 0.178 0.1079 0.0327 8e-04 0.0653 0.0634 0.0101
## s.e. NaN NaN NaN NaN NaN NaN NaN NaN NaN
## ma9 ma10 ma11 intercept xreg1 xreg2 xreg3
## 0.0076 0.0436 0.0388 -450.0970 0.1404 0.0732 0.0487
## s.e. NaN 0.0533 0.0598 33.0371 0.0164 0.0184 0.0316
##
## sigma^2 estimated as 19786: log likelihood = -2132.8, aic = 4299.6
## [1] 4299.597
## [1] 4364.438
We selected two models for prediction. Here, it can be seen their accuracy values. According to box plot, the weighted mean absolute errors for Arima model is higher. We should choose Linear model because WMAPE value of the model is lower which is a sign for better model.
## variable n mean sd CV FBias MAPE RMSE
## 1: lm_prediction 14 542.4286 335.978 0.6193958 -0.1358889 0.2050354 263.4115
## 2: selected_arima 14 542.4286 335.978 0.6193958 0.8441860 0.8331456 649.5670
## MAD MADP WMAPE
## 1: 115.9278 0.2137200 0.2137200
## 2: 512.1721 0.9442203 0.9442203
For conclusion, here is a plot of actual test set and predicted values of chosen model. As it can be seen, the predictions are pretty accurate.
At below,looking at the plots of the product; in line graph it can be observed that the sales have variance, in some dates the plot has peaks and also there might be a cyclical behaviour which is an indicator for seasonality. For further investigation, ‘3 Months Sales of 2021’ plot can be examined, there is not clear repeating pattern that can be easily observed.
Looking at the boxplots; in the weekly boxplot the sales is weekdays seem to be similar, daily and weekly seasonaity can be investigated. In monthly boxplot, there is change with respect to months, there is no clear repeating monthly behaviour. In histograms, one can observe that the sales’ distribution is close to normal distribution.
Firstly, different ARIMA models can be built in order to test different models on the test set. For this purpose, before building an ARIMA model, the data should be decomposed,a frequency value should be chosen. 30 and 7 day frequency can be selected and the data can be decomposed accordingly. Along with 30 and 7 day frequency, ACF plot of the data can be examined and in the lag that we see high autocorrelation it can be chosen as another trial frequency to decompose. Since variance don’t seem to be increasing, additive type of decomposition can be used for decomposition. Below, the random series can be seen.
Decomposition with 7 Day Freq
The above decomposition series belong to time series with 7 and 30 days frequency, respectively.
Looking at the ACF plot of the series, highest ACF value belongs to lag 32, so time series decomposition with 32 day frequency would be sufficient.
In time series decomposition, it is assumed that the random part is randomly distributed with mean zero and standard deviation 1; in order to decide on the best frequency, the random part of the decomposed series should be observed. In this case, the random part of the decomposed time series with 7 day frequency seem to be closer to randomly distributed series with mean zero and std dev 1, so it is chosen as the final decomposition.
After the decomposition, (p,d,q) values should be chosen for the model. For this task, ACF and PACF will be examined.For q, peaks at ACF function can be chosen and for p values, peaks at PACF function can be chosen. Looking at the ACF, for ‘q’ value 3 or 4 may be selected and looking at the PACF, for ‘p’ value 3 or 9 may be selected. Also, auto.arima function is used as well. The AIC and BIC values of models that are suggested can be seen below. Smaller AIC and BIC values means the model is better. So, looking at AIC and BIC values, (3,0,4) model that auto.arima has suggested is best among them.
##
## Call:
## arima(x = detrend, order = c(3, 0, 3))
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 ma3 intercept
## 0.3596 0.1296 -0.3567 -0.5363 -0.4437 -0.0199 -0.0171
## s.e. 0.1564 0.2232 0.1428 0.1644 0.2472 0.2030 0.0711
##
## sigma^2 estimated as 9101: log likelihood = -2381.81, aic = 4779.61
## [1] 4779.61
## [1] 4811.502
##
## Call:
## arima(x = detrend, order = c(3, 0, 4))
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 ma3 ma4 intercept
## 0.8120 0.1637 -0.1956 -1.0467 -0.4247 0.0305 0.4411 -0.0200
## s.e. 0.4263 0.7867 0.4415 0.3881 0.9524 0.7568 0.1966 0.0114
##
## sigma^2 estimated as 8602: log likelihood = -2373.03, aic = 4764.07
## [1] 4764.067
## [1] 4799.945
##
## Call:
## arima(x = detrend, order = c(9, 0, 4))
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ar6 ar7 ar8 ar9
## 0.5253 0.3001 0.1109 -0.4532 0.2384 0.0173 -0.0734 0.1143 -0.0900
## s.e. 0.1112 0.1566 0.1444 0.1133 0.0740 0.0734 0.0676 0.0623 0.0563
## ma1 ma2 ma3 ma4 intercept
## -0.7536 -0.6178 -0.4313 0.8027 -0.0193
## s.e. 0.1060 0.1732 0.1505 0.0946 0.0126
##
## sigma^2 estimated as 8307: log likelihood = -2366.08, aic = 4762.16
## [1] 4762.156
## [1] 4821.952
##
## Fitting models using approximations to speed things up...
##
## ARIMA(2,0,2) with non-zero mean : 4804.437
## ARIMA(0,0,0) with non-zero mean : 4936.799
## ARIMA(1,0,0) with non-zero mean : 4930.902
## ARIMA(0,0,1) with non-zero mean : 4928.255
## ARIMA(0,0,0) with zero mean : 4934.779
## ARIMA(1,0,2) with non-zero mean : Inf
## ARIMA(2,0,1) with non-zero mean : 4804.68
## ARIMA(3,0,2) with non-zero mean : Inf
## ARIMA(2,0,3) with non-zero mean : Inf
## ARIMA(1,0,1) with non-zero mean : 4930.494
## ARIMA(1,0,3) with non-zero mean : Inf
## ARIMA(3,0,1) with non-zero mean : Inf
## ARIMA(3,0,3) with non-zero mean : Inf
## ARIMA(2,0,2) with zero mean : 4802.813
## ARIMA(1,0,2) with zero mean : 4829.38
## ARIMA(2,0,1) with zero mean : 4803.171
## ARIMA(3,0,2) with zero mean : Inf
## ARIMA(2,0,3) with zero mean : Inf
## ARIMA(1,0,1) with zero mean : 4928.454
## ARIMA(1,0,3) with zero mean : Inf
## ARIMA(3,0,1) with zero mean : Inf
## ARIMA(3,0,3) with zero mean : Inf
##
## Now re-fitting the best model(s) without approximations...
##
## ARIMA(2,0,2) with zero mean : Inf
## ARIMA(2,0,1) with zero mean : Inf
## ARIMA(2,0,2) with non-zero mean : Inf
## ARIMA(2,0,1) with non-zero mean : Inf
## ARIMA(1,0,2) with zero mean : Inf
## ARIMA(0,0,1) with non-zero mean : 4928.265
##
## Best model: ARIMA(0,0,1) with non-zero mean
## Series: detrend
## ARIMA(0,0,1) with non-zero mean
##
## Coefficients:
## ma1 mean
## 0.1699 -0.140
## s.e. 0.0484 6.876
##
## sigma^2 estimated as 13828: log likelihood=-2461.1
## AIC=4928.2 AICc=4928.26 BIC=4940.16
## [1] 4928.204
## [1] 4940.163
The second type of model that is going to used is linear regression model. Below, you can see the correlations between the attributes. According to this matrix, basket_count, price_count, visit_count and favored_count can be added to the model. since ,above, in the box plots, it has been observed that there is monthly change in the data, so month information can also be added to the candidate models.
Different linear regression models and ARIMA models’ performance on the test dates will be calculated and according to their performance, best model can be selected.
## variable n mean sd CV FBias MAPE
## 1: lm_prediction2 14 451.5714 90.71063 0.2008777 -0.02509715 0.09312132
## 2: lm_prediction3 14 451.5714 90.71063 0.2008777 -0.07632216 0.11880289
## 3: lm_prediction4 14 451.5714 90.71063 0.2008777 -0.08353170 0.11647223
## 4: lm_prediction5 14 451.5714 90.71063 0.2008777 -0.11399446 0.12828656
## 5: lm_prediction6 14 451.5714 90.71063 0.2008777 -0.03476233 0.07662185
## 6: lm_prediction7 14 451.5714 90.71063 0.2008777 -0.10582440 0.12395939
## 7: arima_prediction 14 451.5714 90.71063 0.2008777 0.05141121 0.12779687
## 8: sarima_prediction 14 451.5714 90.71063 0.2008777 0.05256333 0.12798436
## 9: selected_arima 14 451.5714 90.71063 0.2008777 0.09418716 0.17941751
## RMSE MAD MADP WMAPE
## 1: 49.31985 40.23665 0.08910363 0.08910363
## 2: 58.61266 50.09150 0.11092707 0.11092707
## 3: 59.53828 49.53706 0.10969928 0.10969928
## 4: 64.90818 55.99223 0.12399418 0.12399418
## 5: 42.32081 32.44684 0.07185318 0.07185318
## 6: 60.99548 52.93493 0.11722384 0.11722384
## 7: 77.45611 61.04713 0.13518821 0.13518821
## 8: 77.46723 61.18399 0.13549128 0.13549128
## 9: 100.82860 81.07444 0.17953847 0.17953847
Smallest Weighted Mean Absolute Percentage Error is obtained for the linear regression model ‘sold_count~basket_count + visit_count + as.factor(mon)+ as.factor(is_campaign)’, so further on this model is selected for our prediction purposes.
For conclusion, here is a plot of actual test set and predicted values of chosen model. As it can be seen, the predictions are pretty accurate.
## One Day Ahead Prediction with the Selected Model for Product 3
With the selected model, 1 day ahead prediction can be performed using all the data on hand, since in this competition one day ahead prediction should be submitted.
## price event_date product_content_id sold_count visit_count favored_count
## 1: 114.15 2021-07-02 6676673 307 11850 672
## basket_count category_sold category_brand_sold category_visits ty_visits
## 1: 1001 4255 778 224985 99819109
## category_basket category_favored w_day mon is_campaign
## 1: 18828 17424 6 7 0
## price event_date product_content_id sold_count visit_count favored_count
## 1: 114.15 2021-07-04 6676673 307 11850 672
## basket_count category_sold category_brand_sold category_visits ty_visits
## 1: 1001 4255 778 224985 99819109
## category_basket category_favored w_day mon is_campaign lm_prediction
## 1: 18828 17424 6 7 0 380.0093
At below,looking at the plots of the product; in line graph it can be observed that the sales have variance, in some dates the plot has high outliers and also there might be a cyclical behaviour which is an indicator for seasonality. For further investigation, ‘3 Months Sales of 2021’ plot can be examined, there is not clear repeating pattern that can be easily observed.
Looking at the boxplots; in the weekly boxplot the sales is weekdays seem to be similar, daily and weekly seasonaity can be investigated. In monthly boxplot, there is change with respect to months, there is no clear repeating monthly behaviour. In histograms, one can observe that the sales’ distribution is close to normal distribution.
Firstly, different ARIMA models can be built in order to test different models on the test set. 30 and 7 day frequency can be selected and the data can be decomposed accordingly. Since variance don’t seem to be increasing, additive type of decomposition can be used for decomposition. Below, the random series can be seen.
The above decomposition series belong to time series with 7 and 30 days frequency, respectively. Looking at the ACF plot of the series, highest ACF value belongs to lag 16, so time series decomposition with 16 day frequency would be sufficient.
In this case, the random part of the decomposed time series with 16 day frequency seem to be closer to randomly distributed series with mean zero and std dev 1, so it is chosen as the final decomposition.
Looking at the ACF, for ‘q’ value 5 or 7 may be selected and looking at the PACF, for ‘p’ value 1 or 3 may be selected. Also, auto.arima function is used as well. The AIC and BIC values of models that are suggested can be seen below. So, looking at AIC and BIC values, ARIMA(3,0,5) model that is selected with observing the ACF and PACF plots, ARIMA(3,0,5) model’s AIC value is smaller than the ARIMA(1,0,2) model’s AIC value which is suggested by auto arima. For performance comparison with linear models, ARIMA(3,0,5) will be used.
##
## Call:
## arima(x = detrend, order = c(3, 0, 7))
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 ma3 ma4 ma5
## 0.7349 0.7381 -0.5911 -0.4786 -0.9036 0.0535 -0.0504 0.1568
## s.e. NaN NaN NaN NaN NaN 0.0764 0.0763 0.0787
## ma6 ma7 intercept
## 0.1090 0.1132 -0.1069
## s.e. 0.0433 0.0595 0.0620
##
## sigma^2 estimated as 13934: log likelihood = -2406.41, aic = 4836.82
## [1] 4836.816
## [1] 4884.348
##
## Call:
## arima(x = detrend, order = c(3, 0, 5))
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 ma3 ma4 ma5
## 0.7781 0.8652 -0.7584 -0.5273 -1.063 0.1628 0.1188 0.3088
## s.e. NaN NaN NaN NaN NaN 0.0807 0.0586 0.0509
## intercept
## -0.0955
## s.e. 0.0798
##
## sigma^2 estimated as 14197: log likelihood = -2409.5, aic = 4839
## [1] 4839.004
## [1] 4878.614
##
## Call:
## arima(x = detrend, order = c(1, 0, 5))
##
## Coefficients:
## ar1 ma1 ma2 ma3 ma4 ma5 intercept
## 0.5853 -0.2723 -0.0939 -0.2944 -0.1989 -0.1404 -0.0759
## s.e. 0.0724 0.0781 0.0544 0.0586 0.0594 0.0597 0.3711
##
## sigma^2 estimated as 14991: log likelihood = -2418.13, aic = 4852.26
## [1] 4852.26
## [1] 4883.949
##
## Fitting models using approximations to speed things up...
##
## ARIMA(2,0,2) with non-zero mean : 4865.152
## ARIMA(0,0,0) with non-zero mean : 5017.066
## ARIMA(1,0,0) with non-zero mean : 4917.217
## ARIMA(0,0,1) with non-zero mean : 4938.712
## ARIMA(0,0,0) with zero mean : 5015.046
## ARIMA(1,0,2) with non-zero mean : 4907.553
## ARIMA(2,0,1) with non-zero mean : 4920.622
## ARIMA(3,0,2) with non-zero mean : Inf
## ARIMA(2,0,3) with non-zero mean : 4857.729
## ARIMA(1,0,3) with non-zero mean : 4908.329
## ARIMA(3,0,3) with non-zero mean : Inf
## ARIMA(2,0,4) with non-zero mean : 4859.033
## ARIMA(1,0,4) with non-zero mean : Inf
## ARIMA(3,0,4) with non-zero mean : Inf
## ARIMA(2,0,3) with zero mean : 4856.042
## ARIMA(1,0,3) with zero mean : 4906.267
## ARIMA(2,0,2) with zero mean : 4863.372
## ARIMA(3,0,3) with zero mean : Inf
## ARIMA(2,0,4) with zero mean : 4857.376
## ARIMA(1,0,2) with zero mean : 4905.5
## ARIMA(1,0,4) with zero mean : Inf
## ARIMA(3,0,2) with zero mean : Inf
## ARIMA(3,0,4) with zero mean : Inf
##
## Now re-fitting the best model(s) without approximations...
##
## ARIMA(2,0,3) with zero mean : Inf
## ARIMA(2,0,4) with zero mean : Inf
## ARIMA(2,0,3) with non-zero mean : Inf
## ARIMA(2,0,4) with non-zero mean : Inf
## ARIMA(2,0,2) with zero mean : Inf
## ARIMA(2,0,2) with non-zero mean : Inf
## ARIMA(1,0,2) with zero mean : 4904.915
##
## Best model: ARIMA(1,0,2) with zero mean
## Series: detrend
## ARIMA(1,0,2) with zero mean
##
## Coefficients:
## ar1 ma1 ma2
## 0.1387 0.3543 0.2752
## s.e. 0.1436 0.1378 0.0693
##
## sigma^2 estimated as 17847: log likelihood=-2448.41
## AIC=4904.81 AICc=4904.91 BIC=4920.65
## [1] 4904.81
## [1] 4920.654
Below, you can see the correlations between the attributes. According to this matrix, basket_count, category_favored, is_campaign and category_sold can be added to the model, with different combinations. Since ,above, in the box plots, it has been observed that there is monthly change in the data, so month information can also be added to the candidate models.
Different linear regression models and ARIMA models’ performance on the test dates will be calculated and according to their performance, best model can be selected.
## variable n mean sd CV FBias MAPE RMSE
## 1: lm_prediction1 14 21 7.200427 0.3428775 -0.20693883 0.2697431 5.966694
## 2: lm_prediction2 14 21 7.200427 0.3428775 -2.97927177 3.4236791 75.719475
## 3: lm_prediction3 14 21 7.200427 0.3428775 -3.28869474 3.8131788 83.123744
## 4: lm_prediction4 14 21 7.200427 0.3428775 -3.05884773 3.5175993 76.628455
## 5: lm_prediction5 14 21 7.200427 0.3428775 -0.35648820 0.3872554 13.486489
## 6: lm_prediction6 14 21 7.200427 0.3428775 -2.81391925 3.2181353 71.004414
## 7: arima_prediction 14 21 7.200427 0.3428775 -0.09014912 0.2865406 7.276734
## 8: sarima_prediction 14 21 7.200427 0.3428775 0.02528538 0.2798477 7.197155
## 9: selected_arima 14 21 7.200427 0.3428775 0.10692728 0.3737146 9.239168
## MAD MADP WMAPE
## 1: 5.193758 0.2473218 0.2473218
## 2: 62.564707 2.9792718 2.9792718
## 3: 69.062590 3.2886947 3.2886947
## 4: 64.235802 3.0588477 3.0588477
## 5: 8.640406 0.4114479 0.4114479
## 6: 59.092304 2.8139193 2.8139193
## 7: 5.722414 0.2724959 0.2724959
## 8: 5.455298 0.2597761 0.2597761
## 9: 7.505308 0.3573956 0.3573956
Smallest Weighted Mean Absolute Percentage Error is obtained for the linear regression model ‘sold_count~basket_count +as.factor(mon)’,but, since it has 2 input attributes, when one of them increases slightly, its effect will be much more impactful, so it has been chosen to continue with the model that has second smallest WMAPE , ARIMA(1,1,4) with decomposed series with 16 day frequency,and it is the model that auto arima suggested. So further on this model is selected for our prediction purposes.
For conclusion, here is a plot of actual test set and predicted values of chosen model. As it can be seen, the predictions are not too far.
With the selected model, 1 day ahead prediction can be performed using all the data on hand, since in this competition one day ahead prediction should be submitted.
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.0068
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
##
## Call:
## arima(x = detrend1, order = c(1, 1, 4), xreg = data_7061886$is_campaign)
##
## Coefficients:
## ar1 ma1 ma2 ma3 ma4 data_7061886$is_campaign
## -0.0682 -0.3683 -0.1179 -0.1860 -0.3279 21.6048
## s.e. 0.1652 0.1548 0.1040 0.0564 0.0647 5.4068
##
## sigma^2 estimated as 452.3: log likelihood = -1734.73, aic = 3483.45
## [1] 3483.451
## [1] 3511.16
## price event_date product_content_id sold_count visit_count favored_count
## 1: 297.08 2021-07-04 7061886 18 1249 131
## basket_count category_sold category_brand_sold category_visits ty_visits
## 1: 70 737 163 53346 99819109
## category_basket category_favored w_day mon is_campaign arima1_prediction
## 1: 2800 4702 6 7 0 4.469781
At below,looking at the plots of the product; in line graph it can be observed that the sales have increasing variance, in some dates the plot has high outliers and also there might be a cyclical behaviour which is an indicator for seasonality. For further investigation, ‘3 Months Sales of 2021’ plot can be examined, there is not clear repeating pattern that can be easily observed.
Looking at the boxplots; in the weekly boxplot the sales is weekdays seem to be similar, daily and weekly seasonaity can be investigated. In monthly boxplot, there is change with respect to months, however median of the months seem to be close to each other, this may be an indicator for monthly seasonality. In histograms, one can observe that the sales’ distribution is close to normal distribution.
Firstly, different ARIMA models can be built in order to test different models on the test set. 30 and 7 day frequency can be selected and the data can be decomposed accordingly. Since variance seem to be increasing, multiplicative type of decomposition can be used for decomposition. Below, the random series can be seen.
The above decomposition series belong to time series with 7 and 30 days frequency, respectively. Looking at the ACF plot of the series, highest ACF value belongs to lag 16, so time series decomposition with 16 day frequency would be sufficient.
In this case, the random part of the decomposed time series with 16 day frequency seem to be closer to randomly distributed series with mean zero and std dev 1, so it is chosen as the final decomposition.
Looking at the ACF, for ‘q’ value 2,5 or 8 may be selected and looking at the PACF, for ‘p’ value 3 or 4 may be selected. Also, auto.arima function is used as well. The AIC and BIC values of models that are suggested can be seen below. So, looking at AIC and BIC values, ARIMA(3,0,5) model that is selected with observing the ACF and PACF plots, ARIMA(3,0,5) model’s AIC value is smaller than the ARIMA(1,0,3) model’s AIC value which is suggested by auto arima. For performance comparison with linear models, ARIMA(3,0,5) will be used. ARIMA(3,0,5) best.
Below, you can see the correlations between the attributes. According to this matrix, basket_count, favored_count, is_campaign and category_sold can be added to the model, with different combinations. Since ,above, in the box plots, it has been observed that there is monthly change in the data, so month information can also be added to the candidate models.
Different linear regression models and ARIMA models’ performance on the test dates will be calculated and according to their performance, best model can be selected.
## [1] "input_series=data$sold_count"
##
## ARIMA(0,0,0) with zero mean : 6394.698
## ARIMA(0,0,0) with non-zero mean : 6220.27
## ARIMA(0,0,1) with zero mean : 6126.411
## ARIMA(0,0,1) with non-zero mean : 6009.282
## ARIMA(0,0,2) with zero mean : 6043.806
## ARIMA(0,0,2) with non-zero mean : 5957.195
## ARIMA(0,0,3) with zero mean : 5942.598
## ARIMA(0,0,3) with non-zero mean : 5884.053
## ARIMA(0,0,4) with zero mean : 5921.67
## ARIMA(0,0,4) with non-zero mean : 5877.716
## ARIMA(0,0,5) with zero mean : 5918.286
## ARIMA(0,0,5) with non-zero mean : 5879.596
## ARIMA(1,0,0) with zero mean : 5928.848
## ARIMA(1,0,0) with non-zero mean : 5911.463
## ARIMA(1,0,1) with zero mean : 5929.506
## ARIMA(1,0,1) with non-zero mean : 5909.434
## ARIMA(1,0,2) with zero mean : 5926.647
## ARIMA(1,0,2) with non-zero mean : 5903.617
## ARIMA(1,0,3) with zero mean : 5911.226
## ARIMA(1,0,3) with non-zero mean : 5877.901
## ARIMA(1,0,4) with zero mean : Inf
## ARIMA(1,0,4) with non-zero mean : 5879.617
## ARIMA(2,0,0) with zero mean : 5929.216
## ARIMA(2,0,0) with non-zero mean : 5907.817
## ARIMA(2,0,1) with zero mean : 5930.771
## ARIMA(2,0,1) with non-zero mean : 5902.491
## ARIMA(2,0,2) with zero mean : 5925.483
## ARIMA(2,0,2) with non-zero mean : 5891.825
## ARIMA(2,0,3) with zero mean : 5911.948
## ARIMA(2,0,3) with non-zero mean : 5879.561
## ARIMA(3,0,0) with zero mean : 5928.15
## ARIMA(3,0,0) with non-zero mean : 5900.006
## ARIMA(3,0,1) with zero mean : 5930.061
## ARIMA(3,0,1) with non-zero mean : 5899.854
## ARIMA(3,0,2) with zero mean : 5933.983
## ARIMA(3,0,2) with non-zero mean : 5887.709
## ARIMA(4,0,0) with zero mean : 5929.24
## ARIMA(4,0,0) with non-zero mean : 5894.763
## ARIMA(4,0,1) with zero mean : 5925.197
## ARIMA(4,0,1) with non-zero mean : 5891.308
## ARIMA(5,0,0) with zero mean : 5906.657
## ARIMA(5,0,0) with non-zero mean : 5884.985
##
##
##
## Best model: ARIMA(0,0,4) with non-zero mean
##
## [1] "input_series=ts(data$sold_count,freq=16)"
##
## ARIMA(0,0,0) with zero mean : 6394.698
## ARIMA(0,0,0) with non-zero mean : 6220.27
## ARIMA(0,0,0)(0,0,1)[16] with zero mean : 6294.138
## ARIMA(0,0,0)(0,0,1)[16] with non-zero mean : 6182.676
## ARIMA(0,0,0)(0,0,2)[16] with zero mean : 6267.205
## ARIMA(0,0,0)(0,0,2)[16] with non-zero mean : 6181.942
## ARIMA(0,0,0)(1,0,0)[16] with zero mean : 6247.732
## ARIMA(0,0,0)(1,0,0)[16] with non-zero mean : 6180.785
## ARIMA(0,0,0)(1,0,1)[16] with zero mean : 6236.257
## ARIMA(0,0,0)(1,0,1)[16] with non-zero mean : 6182.301
## ARIMA(0,0,0)(1,0,2)[16] with zero mean : Inf
## ARIMA(0,0,0)(1,0,2)[16] with non-zero mean : 6183.096
## ARIMA(0,0,0)(2,0,0)[16] with zero mean : 6244.526
## ARIMA(0,0,0)(2,0,0)[16] with non-zero mean : 6182.286
## ARIMA(0,0,0)(2,0,1)[16] with zero mean : Inf
## ARIMA(0,0,0)(2,0,1)[16] with non-zero mean : Inf
## ARIMA(0,0,0)(2,0,2)[16] with zero mean : Inf
## ARIMA(0,0,0)(2,0,2)[16] with non-zero mean : Inf
## ARIMA(0,0,1) with zero mean : 6126.411
## ARIMA(0,0,1) with non-zero mean : 6009.282
## ARIMA(0,0,1)(0,0,1)[16] with zero mean : 6069.038
## ARIMA(0,0,1)(0,0,1)[16] with non-zero mean : 5987.659
## ARIMA(0,0,1)(0,0,2)[16] with zero mean : 6052.357
## ARIMA(0,0,1)(0,0,2)[16] with non-zero mean : 5987.348
## ARIMA(0,0,1)(1,0,0)[16] with zero mean : 6043.194
## ARIMA(0,0,1)(1,0,0)[16] with non-zero mean : 5985.413
## ARIMA(0,0,1)(1,0,1)[16] with zero mean : 6028.021
## ARIMA(0,0,1)(1,0,1)[16] with non-zero mean : 5987.431
## ARIMA(0,0,1)(1,0,2)[16] with zero mean : Inf
## ARIMA(0,0,1)(1,0,2)[16] with non-zero mean : 5989.218
## ARIMA(0,0,1)(2,0,0)[16] with zero mean : 6037.264
## ARIMA(0,0,1)(2,0,0)[16] with non-zero mean : 5987.43
## ARIMA(0,0,1)(2,0,1)[16] with zero mean : Inf
## ARIMA(0,0,1)(2,0,1)[16] with non-zero mean : 5989.3
## ARIMA(0,0,1)(2,0,2)[16] with zero mean : Inf
## ARIMA(0,0,1)(2,0,2)[16] with non-zero mean : 5991.24
## ARIMA(0,0,2) with zero mean : 6043.806
## ARIMA(0,0,2) with non-zero mean : 5957.195
## ARIMA(0,0,2)(0,0,1)[16] with zero mean : 6001.241
## ARIMA(0,0,2)(0,0,1)[16] with non-zero mean : 5939.246
## ARIMA(0,0,2)(0,0,2)[16] with zero mean : 5992.86
## ARIMA(0,0,2)(0,0,2)[16] with non-zero mean : 5940.305
## ARIMA(0,0,2)(1,0,0)[16] with zero mean : 5986.603
## ARIMA(0,0,2)(1,0,0)[16] with non-zero mean : 5938.173
## ARIMA(0,0,2)(1,0,1)[16] with zero mean : 5977.252
## ARIMA(0,0,2)(1,0,1)[16] with non-zero mean : 5940.24
## ARIMA(0,0,2)(1,0,2)[16] with zero mean : Inf
## ARIMA(0,0,2)(1,0,2)[16] with non-zero mean : 5942.317
## ARIMA(0,0,2)(2,0,0)[16] with zero mean : 5983.547
## ARIMA(0,0,2)(2,0,0)[16] with non-zero mean : 5940.24
## ARIMA(0,0,2)(2,0,1)[16] with zero mean : Inf
## ARIMA(0,0,2)(2,0,1)[16] with non-zero mean : 5942.318
## ARIMA(0,0,3) with zero mean : 5942.598
## ARIMA(0,0,3) with non-zero mean : 5884.053
## ARIMA(0,0,3)(0,0,1)[16] with zero mean : 5917.294
## ARIMA(0,0,3)(0,0,1)[16] with non-zero mean : 5873.652
## ARIMA(0,0,3)(0,0,2)[16] with zero mean : 5915.625
## ARIMA(0,0,3)(0,0,2)[16] with non-zero mean : 5875.614
## ARIMA(0,0,3)(1,0,0)[16] with zero mean : 5911.389
## ARIMA(0,0,3)(1,0,0)[16] with non-zero mean : 5873.482
## ARIMA(0,0,3)(1,0,1)[16] with zero mean : 5904.144
## ARIMA(0,0,3)(1,0,1)[16] with non-zero mean : 5875.549
## ARIMA(0,0,3)(2,0,0)[16] with zero mean : 5910.439
## ARIMA(0,0,3)(2,0,0)[16] with non-zero mean : 5875.553
## ARIMA(0,0,4) with zero mean : 5921.67
## ARIMA(0,0,4) with non-zero mean : 5877.716
## ARIMA(0,0,4)(0,0,1)[16] with zero mean : 5902.592
## ARIMA(0,0,4)(0,0,1)[16] with non-zero mean : 5868.317
## ARIMA(0,0,4)(1,0,0)[16] with zero mean : 5898.323
## ARIMA(0,0,4)(1,0,0)[16] with non-zero mean : 5867.987
## ARIMA(0,0,5) with zero mean : 5918.286
## ARIMA(0,0,5) with non-zero mean : 5879.596
## ARIMA(1,0,0) with zero mean : 5928.848
## ARIMA(1,0,0) with non-zero mean : 5911.463
## ARIMA(1,0,0)(0,0,1)[16] with zero mean : 5913.662
## ARIMA(1,0,0)(0,0,1)[16] with non-zero mean : 5898.296
## ARIMA(1,0,0)(0,0,2)[16] with zero mean : 5915.174
## ARIMA(1,0,0)(0,0,2)[16] with non-zero mean : 5900.264
## ARIMA(1,0,0)(1,0,0)[16] with zero mean : 5912.762
## ARIMA(1,0,0)(1,0,0)[16] with non-zero mean : 5898.366
## ARIMA(1,0,0)(1,0,1)[16] with zero mean : 5914.729
## ARIMA(1,0,0)(1,0,1)[16] with non-zero mean : 5900.247
## ARIMA(1,0,0)(1,0,2)[16] with zero mean : 5915.576
## ARIMA(1,0,0)(1,0,2)[16] with non-zero mean : Inf
## ARIMA(1,0,0)(2,0,0)[16] with zero mean : 5914.766
## ARIMA(1,0,0)(2,0,0)[16] with non-zero mean : 5900.282
## ARIMA(1,0,0)(2,0,1)[16] with zero mean : Inf
## ARIMA(1,0,0)(2,0,1)[16] with non-zero mean : Inf
## ARIMA(1,0,0)(2,0,2)[16] with zero mean : Inf
## ARIMA(1,0,0)(2,0,2)[16] with non-zero mean : 5904.293
## ARIMA(1,0,1) with zero mean : 5929.506
## ARIMA(1,0,1) with non-zero mean : 5909.434
## ARIMA(1,0,1)(0,0,1)[16] with zero mean : 5914.708
## ARIMA(1,0,1)(0,0,1)[16] with non-zero mean : 5897.18
## ARIMA(1,0,1)(0,0,2)[16] with zero mean : 5915.989
## ARIMA(1,0,1)(0,0,2)[16] with non-zero mean : 5899.016
## ARIMA(1,0,1)(1,0,0)[16] with zero mean : 5913.493
## ARIMA(1,0,1)(1,0,0)[16] with non-zero mean : 5896.933
## ARIMA(1,0,1)(1,0,1)[16] with zero mean : 5915.217
## ARIMA(1,0,1)(1,0,1)[16] with non-zero mean : 5898.97
## ARIMA(1,0,1)(1,0,2)[16] with zero mean : 5915.94
## ARIMA(1,0,1)(1,0,2)[16] with non-zero mean : 5900.99
## ARIMA(1,0,1)(2,0,0)[16] with zero mean : 5915.382
## ARIMA(1,0,1)(2,0,0)[16] with non-zero mean : 5898.974
## ARIMA(1,0,1)(2,0,1)[16] with zero mean : Inf
## ARIMA(1,0,1)(2,0,1)[16] with non-zero mean : 5901.041
## ARIMA(1,0,2) with zero mean : 5926.647
## ARIMA(1,0,2) with non-zero mean : 5903.617
## ARIMA(1,0,2)(0,0,1)[16] with zero mean : 5912.013
## ARIMA(1,0,2)(0,0,1)[16] with non-zero mean : 5892.174
## ARIMA(1,0,2)(0,0,2)[16] with zero mean : 5913.573
## ARIMA(1,0,2)(0,0,2)[16] with non-zero mean : 5894.22
## ARIMA(1,0,2)(1,0,0)[16] with zero mean : 5910.984
## ARIMA(1,0,2)(1,0,0)[16] with non-zero mean : 5892.276
## ARIMA(1,0,2)(1,0,1)[16] with zero mean : 5912.652
## ARIMA(1,0,2)(1,0,1)[16] with non-zero mean : 5894.206
## ARIMA(1,0,2)(2,0,0)[16] with zero mean : 5912.904
## ARIMA(1,0,2)(2,0,0)[16] with non-zero mean : 5894.253
## ARIMA(1,0,3) with zero mean : 5911.226
## ARIMA(1,0,3) with non-zero mean : 5877.901
## ARIMA(1,0,3)(0,0,1)[16] with zero mean : 5896.345
## ARIMA(1,0,3)(0,0,1)[16] with non-zero mean : 5868.702
## ARIMA(1,0,3)(1,0,0)[16] with zero mean : 5893.882
## ARIMA(1,0,3)(1,0,0)[16] with non-zero mean : 5868.454
## ARIMA(1,0,4) with zero mean : Inf
## ARIMA(1,0,4) with non-zero mean : 5879.617
## ARIMA(2,0,0) with zero mean : 5929.216
## ARIMA(2,0,0) with non-zero mean : 5907.817
## ARIMA(2,0,0)(0,0,1)[16] with zero mean : 5914.481
## ARIMA(2,0,0)(0,0,1)[16] with non-zero mean : 5895.919
## ARIMA(2,0,0)(0,0,2)[16] with zero mean : 5915.699
## ARIMA(2,0,0)(0,0,2)[16] with non-zero mean : 5897.689
## ARIMA(2,0,0)(1,0,0)[16] with zero mean : 5913.188
## ARIMA(2,0,0)(1,0,0)[16] with non-zero mean : 5895.566
## ARIMA(2,0,0)(1,0,1)[16] with zero mean : 5914.807
## ARIMA(2,0,0)(1,0,1)[16] with non-zero mean : 5897.626
## ARIMA(2,0,0)(1,0,2)[16] with zero mean : 5915.466
## ARIMA(2,0,0)(1,0,2)[16] with non-zero mean : 5899.655
## ARIMA(2,0,0)(2,0,0)[16] with zero mean : 5914.183
## ARIMA(2,0,0)(2,0,0)[16] with non-zero mean : 5897.627
## ARIMA(2,0,0)(2,0,1)[16] with zero mean : Inf
## ARIMA(2,0,0)(2,0,1)[16] with non-zero mean : 5899.699
## ARIMA(2,0,1) with zero mean : 5930.771
## ARIMA(2,0,1) with non-zero mean : 5902.491
## ARIMA(2,0,1)(0,0,1)[16] with zero mean : 5915.467
## ARIMA(2,0,1)(0,0,1)[16] with non-zero mean : 5890.756
## ARIMA(2,0,1)(0,0,2)[16] with zero mean : 5916.721
## ARIMA(2,0,1)(0,0,2)[16] with non-zero mean : 5892.439
## ARIMA(2,0,1)(1,0,0)[16] with zero mean : 5914.155
## ARIMA(2,0,1)(1,0,0)[16] with non-zero mean : 5890.223
## ARIMA(2,0,1)(1,0,1)[16] with zero mean : 5914.858
## ARIMA(2,0,1)(1,0,1)[16] with non-zero mean : 5892.293
## ARIMA(2,0,1)(2,0,0)[16] with zero mean : 5916.06
## ARIMA(2,0,1)(2,0,0)[16] with non-zero mean : 5892.295
## ARIMA(2,0,2) with zero mean : 5925.483
## ARIMA(2,0,2) with non-zero mean : 5891.825
## ARIMA(2,0,2)(0,0,1)[16] with zero mean : 5910.169
## ARIMA(2,0,2)(0,0,1)[16] with non-zero mean : 5880.613
## ARIMA(2,0,2)(1,0,0)[16] with zero mean : 5908.505
## ARIMA(2,0,2)(1,0,0)[16] with non-zero mean : 5880.4
## ARIMA(2,0,3) with zero mean : 5911.948
## ARIMA(2,0,3) with non-zero mean : 5879.561
## ARIMA(3,0,0) with zero mean : 5928.15
## ARIMA(3,0,0) with non-zero mean : 5900.006
## ARIMA(3,0,0)(0,0,1)[16] with zero mean : 5913.108
## ARIMA(3,0,0)(0,0,1)[16] with non-zero mean : 5888.623
## ARIMA(3,0,0)(0,0,2)[16] with zero mean : 5914.335
## ARIMA(3,0,0)(0,0,2)[16] with non-zero mean : 5890.524
## ARIMA(3,0,0)(1,0,0)[16] with zero mean : 5911.652
## ARIMA(3,0,0)(1,0,0)[16] with non-zero mean : 5888.37
## ARIMA(3,0,0)(1,0,1)[16] with zero mean : 5912.781
## ARIMA(3,0,0)(1,0,1)[16] with non-zero mean : 5890.44
## ARIMA(3,0,0)(2,0,0)[16] with zero mean : 5913.366
## ARIMA(3,0,0)(2,0,0)[16] with non-zero mean : 5890.443
## ARIMA(3,0,1) with zero mean : 5930.061
## ARIMA(3,0,1) with non-zero mean : 5899.854
## ARIMA(3,0,1)(0,0,1)[16] with zero mean : 5914.874
## ARIMA(3,0,1)(0,0,1)[16] with non-zero mean : 5888.16
## ARIMA(3,0,1)(1,0,0)[16] with zero mean : 5913.344
## ARIMA(3,0,1)(1,0,0)[16] with non-zero mean : 5887.858
## ARIMA(3,0,2) with zero mean : 5933.983
## ARIMA(3,0,2) with non-zero mean : 5887.709
## ARIMA(4,0,0) with zero mean : 5929.24
## ARIMA(4,0,0) with non-zero mean : 5894.763
## ARIMA(4,0,0)(0,0,1)[16] with zero mean : 5913.391
## ARIMA(4,0,0)(0,0,1)[16] with non-zero mean : 5882.648
## ARIMA(4,0,0)(1,0,0)[16] with zero mean : 5911.555
## ARIMA(4,0,0)(1,0,0)[16] with non-zero mean : 5882.25
## ARIMA(4,0,1) with zero mean : 5925.197
## ARIMA(4,0,1) with non-zero mean : 5891.308
## ARIMA(5,0,0) with zero mean : 5906.657
## ARIMA(5,0,0) with non-zero mean : 5884.985
##
##
##
## Best model: ARIMA(0,0,4)(1,0,0)[16] with non-zero mean
##
## [1] "input_series=data$sold_count"
##
## ARIMA(0,0,0) with zero mean : 6411.073
## ARIMA(0,0,0) with non-zero mean : 6236.361
## ARIMA(0,0,1) with zero mean : 6142.057
## ARIMA(0,0,1) with non-zero mean : 6024.666
## ARIMA(0,0,2) with zero mean : 6059.217
## ARIMA(0,0,2) with non-zero mean : 5972.392
## ARIMA(0,0,3) with zero mean : 5957.704
## ARIMA(0,0,3) with non-zero mean : 5899.022
## ARIMA(0,0,4) with zero mean : 5936.713
## ARIMA(0,0,4) with non-zero mean : 5892.644
## ARIMA(0,0,5) with zero mean : 5933.312
## ARIMA(0,0,5) with non-zero mean : 5894.52
## ARIMA(1,0,0) with zero mean : 5943.923
## ARIMA(1,0,0) with non-zero mean : 5926.474
## ARIMA(1,0,1) with zero mean : 5944.579
## ARIMA(1,0,1) with non-zero mean : 5924.436
## ARIMA(1,0,2) with zero mean : 5941.704
## ARIMA(1,0,2) with non-zero mean : 5918.603
## ARIMA(1,0,3) with zero mean : 5926.236
## ARIMA(1,0,3) with non-zero mean : 5892.825
## ARIMA(1,0,4) with zero mean : Inf
## ARIMA(1,0,4) with non-zero mean : 5894.542
## ARIMA(2,0,0) with zero mean : 5944.289
## ARIMA(2,0,0) with non-zero mean : 5922.817
## ARIMA(2,0,1) with zero mean : 5945.842
## ARIMA(2,0,1) with non-zero mean : 5917.489
## ARIMA(2,0,2) with zero mean : 5940.532
## ARIMA(2,0,2) with non-zero mean : 5906.798
## ARIMA(2,0,3) with zero mean : 5926.952
## ARIMA(2,0,3) with non-zero mean : 5894.486
## ARIMA(3,0,0) with zero mean : 5943.214
## ARIMA(3,0,0) with non-zero mean : 5914.994
## ARIMA(3,0,1) with zero mean : 5945.125
## ARIMA(3,0,1) with non-zero mean : 5914.844
## ARIMA(3,0,2) with zero mean : 5949.048
## ARIMA(3,0,2) with non-zero mean : 5902.651
## ARIMA(4,0,0) with zero mean : 5944.301
## ARIMA(4,0,0) with non-zero mean : 5909.75
## ARIMA(4,0,1) with zero mean : 5940.242
## ARIMA(4,0,1) with non-zero mean : 5906.275
## ARIMA(5,0,0) with zero mean : 5921.646
## ARIMA(5,0,0) with non-zero mean : 5899.919
##
##
##
## Best model: ARIMA(0,0,4) with non-zero mean
##
## [1] "input_series=ts(data$sold_count,freq=16)"
## [1] "input_series=data$sold_count"
##
## ARIMA(0,0,0) with zero mean : 6427.441
## ARIMA(0,0,0) with non-zero mean : 6252.459
## ARIMA(0,0,1) with zero mean : 6157.663
## ARIMA(0,0,1) with non-zero mean : 6040.13
## ARIMA(0,0,2) with zero mean : 6074.59
## ARIMA(0,0,2) with non-zero mean : 5987.647
## ARIMA(0,0,3) with zero mean : 5972.797
## ARIMA(0,0,3) with non-zero mean : 5914.014
## ARIMA(0,0,4) with zero mean : 5951.733
## ARIMA(0,0,4) with non-zero mean : 5907.603
## ARIMA(0,0,5) with zero mean : 5948.315
## ARIMA(0,0,5) with non-zero mean : 5909.476
## ARIMA(1,0,0) with zero mean : 5958.974
## ARIMA(1,0,0) with non-zero mean : 5941.511
## ARIMA(1,0,1) with zero mean : 5959.626
## ARIMA(1,0,1) with non-zero mean : 5939.472
## ARIMA(1,0,2) with zero mean : 5956.74
## ARIMA(1,0,2) with non-zero mean : 5933.614
## ARIMA(1,0,3) with zero mean : 5941.224
## ARIMA(1,0,3) with non-zero mean : 5907.777
## ARIMA(1,0,4) with zero mean : Inf
## ARIMA(1,0,4) with non-zero mean : 5909.497
## ARIMA(2,0,0) with zero mean : 5959.336
## ARIMA(2,0,0) with non-zero mean : 5937.855
## ARIMA(2,0,1) with zero mean : 5960.886
## ARIMA(2,0,1) with non-zero mean : 5932.536
## ARIMA(2,0,2) with zero mean : 5955.56
## ARIMA(2,0,2) with non-zero mean : 5921.801
## ARIMA(2,0,3) with zero mean : 5941.936
## ARIMA(2,0,3) with non-zero mean : 5909.443
## ARIMA(3,0,0) with zero mean : 5958.254
## ARIMA(3,0,0) with non-zero mean : 5930.021
## ARIMA(3,0,1) with zero mean : 5960.163
## ARIMA(3,0,1) with non-zero mean : 5929.876
## ARIMA(3,0,2) with zero mean : 5964.043
## ARIMA(3,0,2) with non-zero mean : 5917.618
## ARIMA(4,0,0) with zero mean : 5959.338
## ARIMA(4,0,0) with non-zero mean : 5924.785
## ARIMA(4,0,1) with zero mean : 5955.262
## ARIMA(4,0,1) with non-zero mean : 5921.289
## ARIMA(5,0,0) with zero mean : 5936.613
## ARIMA(5,0,0) with non-zero mean : 5914.888
##
##
##
## Best model: ARIMA(0,0,4) with non-zero mean
##
## [1] "input_series=ts(data$sold_count,freq=16)"
## [1] "input_series=data$sold_count"
##
## ARIMA(0,0,0) with zero mean : 6443.865
## ARIMA(0,0,0) with non-zero mean : 6268.435
## ARIMA(0,0,1) with zero mean : 6173.382
## ARIMA(0,0,1) with non-zero mean : 6055.43
## ARIMA(0,0,2) with zero mean : 6090.049
## ARIMA(0,0,2) with non-zero mean : 6002.788
## ARIMA(0,0,3) with zero mean : 5987.993
## ARIMA(0,0,3) with non-zero mean : 5928.924
## ARIMA(0,0,4) with zero mean : 5966.857
## ARIMA(0,0,4) with non-zero mean : 5922.489
## ARIMA(0,0,5) with zero mean : 5963.413
## ARIMA(0,0,5) with non-zero mean : 5924.36
## ARIMA(1,0,0) with zero mean : 5974.091
## ARIMA(1,0,0) with non-zero mean : 5956.506
## ARIMA(1,0,1) with zero mean : 5974.742
## ARIMA(1,0,1) with non-zero mean : 5954.456
## ARIMA(1,0,2) with zero mean : 5971.838
## ARIMA(1,0,2) with non-zero mean : 5948.576
## ARIMA(1,0,3) with zero mean : 5956.299
## ARIMA(1,0,3) with non-zero mean : 5922.662
## ARIMA(1,0,4) with zero mean : Inf
## ARIMA(1,0,4) with non-zero mean : 5924.381
## ARIMA(2,0,0) with zero mean : 5974.451
## ARIMA(2,0,0) with non-zero mean : 5952.834
## ARIMA(2,0,1) with zero mean : 5976.002
## ARIMA(2,0,1) with non-zero mean : 5947.503
## ARIMA(2,0,2) with zero mean : 5970.652
## ARIMA(2,0,2) with non-zero mean : 5936.729
## ARIMA(2,0,3) with zero mean : 5957.005
## ARIMA(2,0,3) with non-zero mean : 5924.327
## ARIMA(3,0,0) with zero mean : 5973.359
## ARIMA(3,0,0) with non-zero mean : 5944.976
## ARIMA(3,0,1) with zero mean : 5975.269
## ARIMA(3,0,1) with non-zero mean : 5944.828
## ARIMA(3,0,2) with zero mean : 5979.166
## ARIMA(3,0,2) with non-zero mean : 5932.525
## ARIMA(4,0,0) with zero mean : 5974.444
## ARIMA(4,0,0) with non-zero mean : 5939.724
## ARIMA(4,0,1) with zero mean : 5970.356
## ARIMA(4,0,1) with non-zero mean : 5936.211
## ARIMA(5,0,0) with zero mean : 5951.655
## ARIMA(5,0,0) with non-zero mean : 5929.788
##
##
##
## Best model: ARIMA(0,0,4) with non-zero mean
##
## [1] "input_series=ts(data$sold_count,freq=16)"
## [1] "input_series=data$sold_count"
##
## ARIMA(0,0,0) with zero mean : 6460.525
## ARIMA(0,0,0) with non-zero mean : 6284.275
## ARIMA(0,0,1) with zero mean : 6189.3
## ARIMA(0,0,1) with non-zero mean : 6070.695
## ARIMA(0,0,2) with zero mean : 6105.797
## ARIMA(0,0,2) with non-zero mean : 6017.938
## ARIMA(0,0,3) with zero mean : 6003.444
## ARIMA(0,0,3) with non-zero mean : 5943.901
## ARIMA(0,0,4) with zero mean : 5982.251
## ARIMA(0,0,4) with non-zero mean : 5937.465
## ARIMA(0,0,5) with zero mean : 5978.782
## ARIMA(0,0,5) with non-zero mean : 5939.338
## ARIMA(1,0,0) with zero mean : 5989.48
## ARIMA(1,0,0) with non-zero mean : 5971.634
## ARIMA(1,0,1) with zero mean : 5990.12
## ARIMA(1,0,1) with non-zero mean : 5969.553
## ARIMA(1,0,2) with zero mean : 5987.224
## ARIMA(1,0,2) with non-zero mean : 5963.659
## ARIMA(1,0,3) with zero mean : 5971.637
## ARIMA(1,0,3) with non-zero mean : 5937.644
## ARIMA(1,0,4) with zero mean : Inf
## ARIMA(1,0,4) with non-zero mean : 5939.375
## ARIMA(2,0,0) with zero mean : 5989.827
## ARIMA(2,0,0) with non-zero mean : 5967.918
## ARIMA(2,0,1) with zero mean : 5991.371
## ARIMA(2,0,1) with non-zero mean : 5962.536
## ARIMA(2,0,2) with zero mean : 5986.037
## ARIMA(2,0,2) with non-zero mean : 5951.741
## ARIMA(2,0,3) with zero mean : 5972.331
## ARIMA(2,0,3) with non-zero mean : 5939.304
## ARIMA(3,0,0) with zero mean : 5988.74
## ARIMA(3,0,0) with non-zero mean : 5960.018
## ARIMA(3,0,1) with zero mean : 5990.65
## ARIMA(3,0,1) with non-zero mean : 5959.852
## ARIMA(3,0,2) with zero mean : 5994.578
## ARIMA(3,0,2) with non-zero mean : 5947.546
## ARIMA(4,0,0) with zero mean : 5989.823
## ARIMA(4,0,0) with non-zero mean : 5954.721
## ARIMA(4,0,1) with zero mean : 5985.709
## ARIMA(4,0,1) with non-zero mean : 5951.193
## ARIMA(5,0,0) with zero mean : 5966.948
## ARIMA(5,0,0) with non-zero mean : 5944.777
##
##
##
## Best model: ARIMA(0,0,4) with non-zero mean
##
## [1] "input_series=ts(data$sold_count,freq=16)"
## [1] "input_series=data$sold_count"
##
## ARIMA(0,0,0) with zero mean : 6477.126
## ARIMA(0,0,0) with non-zero mean : 6300.121
## ARIMA(0,0,1) with zero mean : 6205.004
## ARIMA(0,0,1) with non-zero mean : 6085.987
## ARIMA(0,0,2) with zero mean : 6121.194
## ARIMA(0,0,2) with non-zero mean : 6033.099
## ARIMA(0,0,3) with zero mean : 6018.542
## ARIMA(0,0,3) with non-zero mean : 5958.844
## ARIMA(0,0,4) with zero mean : 5997.263
## ARIMA(0,0,4) with non-zero mean : 5952.394
## ARIMA(0,0,5) with zero mean : 5993.777
## ARIMA(0,0,5) with non-zero mean : 5954.265
## ARIMA(1,0,0) with zero mean : 6004.527
## ARIMA(1,0,0) with non-zero mean : 5986.636
## ARIMA(1,0,1) with zero mean : 6005.161
## ARIMA(1,0,1) with non-zero mean : 5984.556
## ARIMA(1,0,2) with zero mean : 6002.251
## ARIMA(1,0,2) with non-zero mean : 5978.644
## ARIMA(1,0,3) with zero mean : 5986.615
## ARIMA(1,0,3) with non-zero mean : 5952.568
## ARIMA(1,0,4) with zero mean : Inf
## ARIMA(1,0,4) with non-zero mean : 5954.287
## ARIMA(2,0,0) with zero mean : 6004.868
## ARIMA(2,0,0) with non-zero mean : 5982.924
## ARIMA(2,0,1) with zero mean : 6006.412
## ARIMA(2,0,1) with non-zero mean : 5977.545
## ARIMA(2,0,2) with zero mean : 6001.055
## ARIMA(2,0,2) with non-zero mean : 5966.715
## ARIMA(2,0,3) with zero mean : 5987.305
## ARIMA(2,0,3) with non-zero mean : 5954.232
## ARIMA(3,0,0) with zero mean : 6003.771
## ARIMA(3,0,0) with non-zero mean : 5975.018
## ARIMA(3,0,1) with zero mean : 6005.681
## ARIMA(3,0,1) with non-zero mean : 5974.852
## ARIMA(3,0,2) with zero mean : 6009.584
## ARIMA(3,0,2) with non-zero mean : 5962.49
## ARIMA(4,0,0) with zero mean : 6004.851
## ARIMA(4,0,0) with non-zero mean : 5969.711
## ARIMA(4,0,1) with zero mean : 6000.722
## ARIMA(4,0,1) with non-zero mean : 5966.162
## ARIMA(5,0,0) with zero mean : 5981.909
## ARIMA(5,0,0) with non-zero mean : 5959.712
##
##
##
## Best model: ARIMA(0,0,4) with non-zero mean
##
## [1] "input_series=ts(data$sold_count,freq=16)"
## [1] "input_series=data$sold_count"
##
## ARIMA(0,0,0) with zero mean : 6493.708
## ARIMA(0,0,0) with non-zero mean : 6315.97
## ARIMA(0,0,1) with zero mean : 6220.824
## ARIMA(0,0,1) with non-zero mean : 6101.242
## ARIMA(0,0,2) with zero mean : 6136.698
## ARIMA(0,0,2) with non-zero mean : 6048.213
## ARIMA(0,0,3) with zero mean : 6033.648
## ARIMA(0,0,3) with non-zero mean : 5973.783
## ARIMA(0,0,4) with zero mean : 6012.288
## ARIMA(0,0,4) with non-zero mean : 5967.309
## ARIMA(0,0,5) with zero mean : 6008.774
## ARIMA(0,0,5) with non-zero mean : 5969.181
## ARIMA(1,0,0) with zero mean : 6019.581
## ARIMA(1,0,0) with non-zero mean : 6001.628
## ARIMA(1,0,1) with zero mean : 6020.215
## ARIMA(1,0,1) with non-zero mean : 5999.536
## ARIMA(1,0,2) with zero mean : 6017.279
## ARIMA(1,0,2) with non-zero mean : 5993.619
## ARIMA(1,0,3) with zero mean : 6001.594
## ARIMA(1,0,3) with non-zero mean : 5967.485
## ARIMA(1,0,4) with zero mean : Inf
## ARIMA(1,0,4) with non-zero mean : 5969.202
## ARIMA(2,0,0) with zero mean : 6019.921
## ARIMA(2,0,0) with non-zero mean : 5997.9
## ARIMA(2,0,1) with zero mean : 6021.464
## ARIMA(2,0,1) with non-zero mean : 5992.514
## ARIMA(2,0,2) with zero mean : 6016.071
## ARIMA(2,0,2) with non-zero mean : 5981.685
## ARIMA(2,0,3) with zero mean : 6002.278
## ARIMA(2,0,3) with non-zero mean : 5969.148
## ARIMA(3,0,0) with zero mean : 6018.809
## ARIMA(3,0,0) with non-zero mean : 5989.986
## ARIMA(3,0,1) with zero mean : 6020.718
## ARIMA(3,0,1) with non-zero mean : 5989.823
## ARIMA(3,0,2) with zero mean : 6024.638
## ARIMA(3,0,2) with non-zero mean : 5977.429
## ARIMA(4,0,0) with zero mean : 6019.886
## ARIMA(4,0,0) with non-zero mean : 5984.677
## ARIMA(4,0,1) with zero mean : 6015.737
## ARIMA(4,0,1) with non-zero mean : 5981.114
## ARIMA(5,0,0) with zero mean : 5996.869
## ARIMA(5,0,0) with non-zero mean : 5974.636
##
##
##
## Best model: ARIMA(0,0,4) with non-zero mean
##
## [1] "input_series=ts(data$sold_count,freq=16)"
## [1] "input_series=data$sold_count"
##
## ARIMA(0,0,0) with zero mean : 6510.13
## ARIMA(0,0,0) with non-zero mean : 6331.928
## ARIMA(0,0,1) with zero mean : 6236.414
## ARIMA(0,0,1) with non-zero mean : 6116.696
## ARIMA(0,0,2) with zero mean : 6152.059
## ARIMA(0,0,2) with non-zero mean : 6063.454
## ARIMA(0,0,3) with zero mean : 6048.71
## ARIMA(0,0,3) with non-zero mean : 5988.861
## ARIMA(0,0,4) with zero mean : 6027.292
## ARIMA(0,0,4) with non-zero mean : 5982.374
## ARIMA(0,0,5) with zero mean : 6023.767
## ARIMA(0,0,5) with non-zero mean : 5984.244
## ARIMA(1,0,0) with zero mean : 6034.656
## ARIMA(1,0,0) with non-zero mean : 6016.774
## ARIMA(1,0,1) with zero mean : 6035.284
## ARIMA(1,0,1) with non-zero mean : 6014.674
## ARIMA(1,0,2) with zero mean : 6032.32
## ARIMA(1,0,2) with non-zero mean : 6008.712
## ARIMA(1,0,3) with zero mean : 6016.585
## ARIMA(1,0,3) with non-zero mean : 5982.549
## ARIMA(1,0,4) with zero mean : Inf
## ARIMA(1,0,4) with non-zero mean : 5984.266
## ARIMA(2,0,0) with zero mean : 6034.988
## ARIMA(2,0,0) with non-zero mean : 6013.033
## ARIMA(2,0,1) with zero mean : 6036.536
## ARIMA(2,0,1) with non-zero mean : 6007.66
## ARIMA(2,0,2) with zero mean : 6031.1
## ARIMA(2,0,2) with non-zero mean : 5996.766
## ARIMA(2,0,3) with zero mean : 6017.27
## ARIMA(2,0,3) with non-zero mean : 5984.212
## ARIMA(3,0,0) with zero mean : 6033.859
## ARIMA(3,0,0) with non-zero mean : 6005.092
## ARIMA(3,0,1) with zero mean : 6035.767
## ARIMA(3,0,1) with non-zero mean : 6004.945
## ARIMA(3,0,2) with zero mean : 6039.674
## ARIMA(3,0,2) with non-zero mean : 5992.479
## ARIMA(4,0,0) with zero mean : 6034.938
## ARIMA(4,0,0) with non-zero mean : 5999.836
## ARIMA(4,0,1) with zero mean : 6030.776
## ARIMA(4,0,1) with non-zero mean : 5996.248
## ARIMA(5,0,0) with zero mean : 6011.85
## ARIMA(5,0,0) with non-zero mean : 5989.709
##
##
##
## Best model: ARIMA(0,0,4) with non-zero mean
##
## [1] "input_series=ts(data$sold_count,freq=16)"
## [1] "input_series=data$sold_count"
##
## ARIMA(0,0,0) with zero mean : 6526.45
## ARIMA(0,0,0) with non-zero mean : 6348.171
## ARIMA(0,0,1) with zero mean : 6251.991
## ARIMA(0,0,1) with non-zero mean : 6132.262
## ARIMA(0,0,2) with zero mean : 6167.41
## ARIMA(0,0,2) with non-zero mean : 6078.952
## ARIMA(0,0,3) with zero mean : 6063.771
## ARIMA(0,0,3) with non-zero mean : 6003.996
## ARIMA(0,0,4) with zero mean : 6042.292
## ARIMA(0,0,4) with non-zero mean : 5997.459
## ARIMA(0,0,5) with zero mean : 6038.756
## ARIMA(0,0,5) with non-zero mean : 5999.328
## ARIMA(1,0,0) with zero mean : 6049.799
## ARIMA(1,0,0) with non-zero mean : 6032.081
## ARIMA(1,0,1) with zero mean : 6050.408
## ARIMA(1,0,1) with non-zero mean : 6029.95
## ARIMA(1,0,2) with zero mean : 6047.422
## ARIMA(1,0,2) with non-zero mean : 6023.978
## ARIMA(1,0,3) with zero mean : 6031.571
## ARIMA(1,0,3) with non-zero mean : 5997.635
## ARIMA(1,0,4) with zero mean : Inf
## ARIMA(1,0,4) with non-zero mean : 5999.352
## ARIMA(2,0,0) with zero mean : 6050.107
## ARIMA(2,0,0) with non-zero mean : 6028.299
## ARIMA(2,0,1) with zero mean : 6051.657
## ARIMA(2,0,1) with non-zero mean : 6022.938
## ARIMA(2,0,2) with zero mean : 6046.168
## ARIMA(2,0,2) with non-zero mean : 6011.962
## ARIMA(2,0,3) with zero mean : 6032.258
## ARIMA(2,0,3) with non-zero mean : 5999.295
## ARIMA(3,0,0) with zero mean : 6048.962
## ARIMA(3,0,0) with non-zero mean : 6020.35
## ARIMA(3,0,1) with zero mean : 6050.869
## ARIMA(3,0,1) with non-zero mean : 6020.21
## ARIMA(3,0,2) with zero mean : 6054.801
## ARIMA(3,0,2) with non-zero mean : 6007.615
## ARIMA(4,0,0) with zero mean : 6050.032
## ARIMA(4,0,0) with non-zero mean : 6015.083
## ARIMA(4,0,1) with zero mean : 6045.838
## ARIMA(4,0,1) with non-zero mean : 6011.443
## ARIMA(5,0,0) with zero mean : 6026.84
## ARIMA(5,0,0) with non-zero mean : 6004.816
##
##
##
## Best model: ARIMA(0,0,4) with non-zero mean
##
## [1] "input_series=ts(data$sold_count,freq=16)"
## [1] "input_series=data$sold_count"
##
## ARIMA(0,0,0) with zero mean : 6542.781
## ARIMA(0,0,0) with non-zero mean : 6364.329
## ARIMA(0,0,1) with zero mean : 6267.595
## ARIMA(0,0,1) with non-zero mean : 6147.668
## ARIMA(0,0,2) with zero mean : 6182.832
## ARIMA(0,0,2) with non-zero mean : 6094.112
## ARIMA(0,0,3) with zero mean : 6078.877
## ARIMA(0,0,3) with non-zero mean : 6018.926
## ARIMA(0,0,4) with zero mean : 6057.375
## ARIMA(0,0,4) with non-zero mean : 6012.338
## ARIMA(0,0,5) with zero mean : 6053.829
## ARIMA(0,0,5) with non-zero mean : 6014.205
## ARIMA(1,0,0) with zero mean : 6064.848
## ARIMA(1,0,0) with non-zero mean : 6047.08
## ARIMA(1,0,1) with zero mean : 6065.46
## ARIMA(1,0,1) with non-zero mean : 6044.934
## ARIMA(1,0,2) with zero mean : 6062.476
## ARIMA(1,0,2) with non-zero mean : 6038.936
## ARIMA(1,0,3) with zero mean : 6046.62
## ARIMA(1,0,3) with non-zero mean : 6012.513
## ARIMA(1,0,4) with zero mean : Inf
## ARIMA(1,0,4) with non-zero mean : 6014.226
## ARIMA(2,0,0) with zero mean : 6065.161
## ARIMA(2,0,0) with non-zero mean : 6043.277
## ARIMA(2,0,1) with zero mean : 6066.7
## ARIMA(2,0,1) with non-zero mean : 6037.888
## ARIMA(2,0,2) with zero mean : 6061.234
## ARIMA(2,0,2) with non-zero mean : 6026.877
## ARIMA(2,0,3) with zero mean : 6047.286
## ARIMA(2,0,3) with non-zero mean : 6014.172
## ARIMA(3,0,0) with zero mean : 6064.021
## ARIMA(3,0,0) with non-zero mean : 6035.296
## ARIMA(3,0,1) with zero mean : 6065.928
## ARIMA(3,0,1) with non-zero mean : 6035.151
## ARIMA(3,0,2) with zero mean : 6069.845
## ARIMA(3,0,2) with non-zero mean : 6022.51
## ARIMA(4,0,0) with zero mean : 6065.091
## ARIMA(4,0,0) with non-zero mean : 6030.014
## ARIMA(4,0,1) with zero mean : 6060.875
## ARIMA(4,0,1) with non-zero mean : 6026.361
## ARIMA(5,0,0) with zero mean : 6041.812
## ARIMA(5,0,0) with non-zero mean : 6019.71
##
##
##
## Best model: ARIMA(0,0,4) with non-zero mean
##
## [1] "input_series=ts(data$sold_count,freq=16)"
## [1] "input_series=data$sold_count"
##
## ARIMA(0,0,0) with zero mean : 6559.094
## ARIMA(0,0,0) with non-zero mean : 6380.584
## ARIMA(0,0,1) with zero mean : 6283.162
## ARIMA(0,0,1) with non-zero mean : 6163.289
## ARIMA(0,0,2) with zero mean : 6198.17
## ARIMA(0,0,2) with non-zero mean : 6109.503
## ARIMA(0,0,3) with zero mean : 6093.933
## ARIMA(0,0,3) with non-zero mean : 6033.971
## ARIMA(0,0,4) with zero mean : 6072.368
## ARIMA(0,0,4) with non-zero mean : 6027.342
## ARIMA(0,0,5) with zero mean : 6068.806
## ARIMA(0,0,5) with non-zero mean : 6029.196
## ARIMA(1,0,0) with zero mean : 6079.883
## ARIMA(1,0,0) with non-zero mean : 6062.171
## ARIMA(1,0,1) with zero mean : 6080.491
## ARIMA(1,0,1) with non-zero mean : 6060.036
## ARIMA(1,0,2) with zero mean : 6077.488
## ARIMA(1,0,2) with non-zero mean : 6053.981
## ARIMA(1,0,3) with zero mean : 6061.583
## ARIMA(1,0,3) with non-zero mean : 6027.496
## ARIMA(1,0,4) with zero mean : Inf
## ARIMA(1,0,4) with non-zero mean : 6029.218
## ARIMA(2,0,0) with zero mean : 6080.191
## ARIMA(2,0,0) with non-zero mean : 6058.382
## ARIMA(2,0,1) with zero mean : 6081.729
## ARIMA(2,0,1) with non-zero mean : 6052.977
## ARIMA(2,0,2) with zero mean : 6076.237
## ARIMA(2,0,2) with non-zero mean : 6041.887
## ARIMA(2,0,3) with zero mean : 6062.246
## ARIMA(2,0,3) with non-zero mean : 6029.165
## ARIMA(3,0,0) with zero mean : 6079.038
## ARIMA(3,0,0) with non-zero mean : 6050.357
## ARIMA(3,0,1) with zero mean : 6080.944
## ARIMA(3,0,1) with non-zero mean : 6050.203
## ARIMA(3,0,2) with zero mean : 6085.093
## ARIMA(3,0,2) with non-zero mean : 6037.523
## ARIMA(4,0,0) with zero mean : 6080.105
## ARIMA(4,0,0) with non-zero mean : 6045.045
## ARIMA(4,0,1) with zero mean : 6075.871
## ARIMA(4,0,1) with non-zero mean : 6041.364
## ARIMA(5,0,0) with zero mean : 6056.76
## ARIMA(5,0,0) with non-zero mean : 6034.691
##
##
##
## Best model: ARIMA(0,0,4) with non-zero mean
##
## [1] "input_series=ts(data$sold_count,freq=16)"
## [1] "input_series=data$sold_count"
##
## ARIMA(0,0,0) with zero mean : 6575.409
## ARIMA(0,0,0) with non-zero mean : 6396.792
## ARIMA(0,0,1) with zero mean : 6298.758
## ARIMA(0,0,1) with non-zero mean : 6178.712
## ARIMA(0,0,2) with zero mean : 6213.517
## ARIMA(0,0,2) with non-zero mean : 6124.772
## ARIMA(0,0,3) with zero mean : 6108.991
## ARIMA(0,0,3) with non-zero mean : 6048.978
## ARIMA(0,0,4) with zero mean : 6087.362
## ARIMA(0,0,4) with non-zero mean : 6042.298
## ARIMA(0,0,5) with zero mean : 6083.782
## ARIMA(0,0,5) with non-zero mean : 6044.148
## ARIMA(1,0,0) with zero mean : 6094.915
## ARIMA(1,0,0) with non-zero mean : 6077.183
## ARIMA(1,0,1) with zero mean : 6095.522
## ARIMA(1,0,1) with non-zero mean : 6075.039
## ARIMA(1,0,2) with zero mean : 6092.5
## ARIMA(1,0,2) with non-zero mean : 6068.988
## ARIMA(1,0,3) with zero mean : 6076.547
## ARIMA(1,0,3) with non-zero mean : 6042.444
## ARIMA(1,0,4) with zero mean : Inf
## ARIMA(1,0,4) with non-zero mean : 6044.17
## ARIMA(2,0,0) with zero mean : 6095.221
## ARIMA(2,0,0) with non-zero mean : 6073.385
## ARIMA(2,0,1) with zero mean : 6096.766
## ARIMA(2,0,1) with non-zero mean : 6067.976
## ARIMA(2,0,2) with zero mean : 6091.241
## ARIMA(2,0,2) with non-zero mean : 6056.901
## ARIMA(2,0,3) with zero mean : 6077.207
## ARIMA(2,0,3) with non-zero mean : 6044.12
## ARIMA(3,0,0) with zero mean : 6094.059
## ARIMA(3,0,0) with non-zero mean : 6065.363
## ARIMA(3,0,1) with zero mean : 6095.965
## ARIMA(3,0,1) with non-zero mean : 6065.203
## ARIMA(3,0,2) with zero mean : 6099.879
## ARIMA(3,0,2) with non-zero mean : 6052.519
## ARIMA(4,0,0) with zero mean : 6095.127
## ARIMA(4,0,0) with non-zero mean : 6060.024
## ARIMA(4,0,1) with zero mean : 6090.876
## ARIMA(4,0,1) with non-zero mean : 6056.334
## ARIMA(5,0,0) with zero mean : 6071.702
## ARIMA(5,0,0) with non-zero mean : 6049.647
##
##
##
## Best model: ARIMA(0,0,4) with non-zero mean
##
## [1] "input_series=ts(data$sold_count,freq=16)"
## [1] "input_series=data$sold_count"
##
## ARIMA(0,0,0) with zero mean : 6591.854
## ARIMA(0,0,0) with non-zero mean : 6412.7
## ARIMA(0,0,1) with zero mean : 6314.502
## ARIMA(0,0,1) with non-zero mean : 6193.966
## ARIMA(0,0,2) with zero mean : 6229.147
## ARIMA(0,0,2) with non-zero mean : 6139.871
## ARIMA(0,0,3) with zero mean : 6124.307
## ARIMA(0,0,3) with non-zero mean : 6063.881
## ARIMA(0,0,4) with zero mean : 6102.601
## ARIMA(0,0,4) with non-zero mean : 6057.187
## ARIMA(0,0,5) with zero mean : 6098.999
## ARIMA(0,0,5) with non-zero mean : 6059.039
## ARIMA(1,0,0) with zero mean : 6110.213
## ARIMA(1,0,0) with non-zero mean : 6092.229
## ARIMA(1,0,1) with zero mean : 6110.815
## ARIMA(1,0,1) with non-zero mean : 6090.062
## ARIMA(1,0,2) with zero mean : 6107.8
## ARIMA(1,0,2) with non-zero mean : 6083.995
## ARIMA(1,0,3) with zero mean : 6091.745
## ARIMA(1,0,3) with non-zero mean : 6057.339
## ARIMA(1,0,4) with zero mean : Inf
## ARIMA(1,0,4) with non-zero mean : 6059.06
## ARIMA(2,0,0) with zero mean : 6110.515
## ARIMA(2,0,0) with non-zero mean : 6088.399
## ARIMA(2,0,1) with zero mean : 6112.041
## ARIMA(2,0,1) with non-zero mean : 6082.957
## ARIMA(2,0,2) with zero mean : 6106.526
## ARIMA(2,0,2) with non-zero mean : 6071.829
## ARIMA(2,0,3) with zero mean : 6092.405
## ARIMA(2,0,3) with non-zero mean : 6059.009
## ARIMA(3,0,0) with zero mean : 6109.361
## ARIMA(3,0,0) with non-zero mean : 6080.34
## ARIMA(3,0,1) with zero mean : 6111.267
## ARIMA(3,0,1) with non-zero mean : 6080.166
## ARIMA(3,0,2) with zero mean : 6115.199
## ARIMA(3,0,2) with non-zero mean : 6067.443
## ARIMA(4,0,0) with zero mean : 6110.424
## ARIMA(4,0,0) with non-zero mean : 6074.96
## ARIMA(4,0,1) with zero mean : 6106.144
## ARIMA(4,0,1) with non-zero mean : 6071.252
## ARIMA(5,0,0) with zero mean : 6086.83
## ARIMA(5,0,0) with non-zero mean : 6064.546
##
##
##
## Best model: ARIMA(0,0,4) with non-zero mean
##
## [1] "input_series=ts(data$sold_count,freq=16)"
## [1] "input_series=data$sold_count"
##
## ARIMA(0,0,0) with zero mean : 6608.296
## ARIMA(0,0,0) with non-zero mean : 6428.606
## ARIMA(0,0,1) with zero mean : 6330.121
## ARIMA(0,0,1) with non-zero mean : 6209.306
## ARIMA(0,0,2) with zero mean : 6244.493
## ARIMA(0,0,2) with non-zero mean : 6155.067
## ARIMA(0,0,3) with zero mean : 6139.405
## ARIMA(0,0,3) with non-zero mean : 6078.792
## ARIMA(0,0,4) with zero mean : 6117.614
## ARIMA(0,0,4) with non-zero mean : 6072.074
## ARIMA(0,0,5) with zero mean : 6113.99
## ARIMA(0,0,5) with non-zero mean : 6073.924
## ARIMA(1,0,0) with zero mean : 6125.247
## ARIMA(1,0,0) with non-zero mean : 6107.211
## ARIMA(1,0,1) with zero mean : 6125.841
## ARIMA(1,0,1) with non-zero mean : 6105.048
## ARIMA(1,0,2) with zero mean : 6122.815
## ARIMA(1,0,2) with non-zero mean : 6098.955
## ARIMA(1,0,3) with zero mean : 6106.716
## ARIMA(1,0,3) with non-zero mean : 6072.223
## ARIMA(1,0,4) with zero mean : Inf
## ARIMA(1,0,4) with non-zero mean : 6073.946
## ARIMA(2,0,0) with zero mean : 6125.539
## ARIMA(2,0,0) with non-zero mean : 6103.388
## ARIMA(2,0,1) with zero mean : 6127.066
## ARIMA(2,0,1) with non-zero mean : 6097.953
## ARIMA(2,0,2) with zero mean : 6121.532
## ARIMA(2,0,2) with non-zero mean : 6086.778
## ARIMA(2,0,3) with zero mean : 6107.374
## ARIMA(2,0,3) with non-zero mean : 6073.896
## ARIMA(3,0,0) with zero mean : 6124.377
## ARIMA(3,0,0) with non-zero mean : 6095.322
## ARIMA(3,0,1) with zero mean : 6126.283
## ARIMA(3,0,1) with non-zero mean : 6095.148
## ARIMA(3,0,2) with zero mean : 6130.203
## ARIMA(3,0,2) with non-zero mean : 6082.345
## ARIMA(4,0,0) with zero mean : 6125.44
## ARIMA(4,0,0) with non-zero mean : 6089.931
## ARIMA(4,0,1) with zero mean : 6121.146
## ARIMA(4,0,1) with non-zero mean : 6086.197
## ARIMA(5,0,0) with zero mean : 6101.778
## ARIMA(5,0,0) with non-zero mean : 6079.455
##
##
##
## Best model: ARIMA(0,0,4) with non-zero mean
##
## [1] "input_series=ts(data$sold_count,freq=16)"
## variable n mean sd CV FBias MAPE
## 1: lm_prediction2 14 412.4286 232.3915 0.5634709 -4.6142064 5.1883257
## 2: lm_prediction3 14 412.4286 232.3915 0.5634709 -4.7166790 5.3069999
## 3: lm_prediction4 14 412.4286 232.3915 0.5634709 -4.5580423 5.0437342
## 4: lm_prediction5 14 412.4286 232.3915 0.5634709 0.1458778 0.6688978
## 5: lm_prediction6 14 412.4286 232.3915 0.5634709 -4.4771036 4.9528052
## 6: arima_prediction 14 412.4286 232.3915 0.5634709 -0.3479633 0.7460511
## 7: sarima_prediction 14 412.4286 232.3915 0.5634709 -0.2817767 0.6716086
## 8: selected_arima 14 412.4286 232.3915 0.5634709 0.1967600 0.6791326
## RMSE MAD MADP WMAPE
## 1: 2184.1016 1903.0305 4.6142064 4.6142064
## 2: 2238.7716 1945.2932 4.7166790 4.7166790
## 3: 2165.7375 1879.8669 4.5580423 4.5580423
## 4: 303.9193 245.5246 0.5953143 0.5953143
## 5: 2125.5963 1846.4854 4.4771036 4.4771036
## 6: 221.0945 188.7099 0.4575578 0.4575578
## 7: 203.6424 168.6188 0.4088437 0.4088437
## 8: 276.4342 230.1284 0.5579837 0.5579837
Smallest Weighted Mean Absolute Percentage Error is obtained for ARIMA(0,0,4) with 16 day frequency decomposition ,and it is the model that auto arima suggested. So further on this model is selected for our prediction purposes.
For conclusion, here is a plot of actual test set and predicted values of chosen model. As it can be seen, the predictions are are not too far.
With the selected model, 1 day ahead prediction can be performed using all the data on hand, since in this competition one day ahead prediction should be submitted.
## price event_date product_content_id sold_count visit_count favored_count
## 1: 51.77 2021-07-02 31515569 267 6757 345
## basket_count category_sold category_brand_sold category_visits ty_visits
## 1: 1075 6486 887 383610 99819109
## category_basket category_favored w_day mon is_campaign
## 1: 34514 33905 6 7 0
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.4811
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.0231
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
##
## Call:
## arima(x = detrend2, order = c(0, 0, 4), xreg = data_31515569$is_campaign, include.mean = TRUE)
##
## Coefficients:
## ma1 ma2 ma3 ma4 intercept data_31515569$is_campaign
## 0.7895 0.5018 0.2625 -0.0056 0.9159 0.6703
## s.e. 0.0537 0.0718 0.0721 0.0583 0.0541 0.1014
##
## sigma^2 estimated as 0.1694: log likelihood = -206.46, aic = 426.92
## [1] 426.9219
## [1] 454.649
## Time Series:
## Start = c(26, 5)
## End = c(26, 5)
## Frequency = 16
## [1] 257.4225
## price event_date product_content_id sold_count visit_count favored_count
## 1: 51.77 2021-07-04 31515569 267 6757 345
## basket_count category_sold category_brand_sold category_visits ty_visits
## 1: 1075 6486 887 383610 99819109
## category_basket category_favored w_day mon is_campaign arima1_prediction
## 1: 34514 33905 6 7 0 257.4225
Before making forecasting models for product 6, it should be looked at the plot of data and examined the seasonalities and trend. Below, you can see the plot of sales quantity of Product 6.For the empty places in sold counts, the mean of the data is taken. There is a slightly increasing trend, especially in the beginning and end of the plot. There can’t be seen any significant seasonality. To look further, there is a plot of 3 months of 2021 - March, April and May -. Again, the seasonality isn’t significant. In conclusion, it can be said that there is no seasonality.
First type of model that is going to used is linear regression model. First of all, it would be wise to select attributes that will help to model from correlation matrix. Below, you can see the correlations between the attributes. According to this matrix, just basket_count can be added to the model.
In the first model, the attribute is added to the model. The adjusted R-squared value indicates whether model is good or not. The value for the first model is pretty high which is a good sign. But there are outliers which is probably due to campaigns and holidays. The outliers can be eliminated for a better model. Lastly, ‘lag1’ attribute can be added because it is very high in the ACF. In the final linear regression model, adjusted R-squared value is high enough and plots are good enough to make predictions.
##
## Call:
## lm(formula = sold_count ~ basket_count, data = sold)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.072 -1.754 1.148 1.148 22.764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.999774 0.936931 10.67 <2e-16 ***
## basket_count 0.126031 0.005347 23.57 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.752 on 367 degrees of freedom
## Multiple R-squared: 0.6022, Adjusted R-squared: 0.6011
## F-statistic: 555.6 on 1 and 367 DF, p-value: < 2.2e-16
##
## Breusch-Godfrey test for serial correlation of order up to 10
##
## data: Residuals
## LM test = 66.191, df = 10, p-value = 2.398e-10
## sold_count
## Min. : 1.00
## 1st Qu.:32.00
## Median :32.89
## Mean :30.47
## 3rd Qu.:32.89
## Max. :81.00
##
## Call:
## lm(formula = sold_count ~ big_outlier + small_outlier + basket_count,
## data = sold)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.3275 -0.3587 -0.3587 -0.3587 18.0575
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.95803 0.75528 29.07 <2e-16 ***
## big_outlier 8.24784 0.77985 10.58 <2e-16 ***
## small_outlier -13.21226 0.58250 -22.68 <2e-16 ***
## basket_count 0.06545 0.00424 15.44 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.156 on 365 degrees of freedom
## Multiple R-squared: 0.8501, Adjusted R-squared: 0.8489
## F-statistic: 690.2 on 3 and 365 DF, p-value: < 2.2e-16
##
## Breusch-Godfrey test for serial correlation of order up to 10
##
## data: Residuals
## LM test = 21.851, df = 10, p-value = 0.01588
##
## Call:
## lm(formula = sold_count ~ lag1 + big_outlier + small_outlier +
## basket_count, data = sold)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.9457 -0.3268 -0.3268 -0.3268 15.8896
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.084918 0.741831 29.771 < 2e-16 ***
## lag1 0.201067 0.051769 3.884 0.000122 ***
## big_outlier 8.280673 0.765271 10.821 < 2e-16 ***
## small_outlier -13.436279 0.574476 -23.389 < 2e-16 ***
## basket_count 0.064946 0.004162 15.603 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.078 on 364 degrees of freedom
## Multiple R-squared: 0.8561, Adjusted R-squared: 0.8545
## F-statistic: 541.4 on 4 and 364 DF, p-value: < 2.2e-16
##
## Breusch-Godfrey test for serial correlation of order up to 10
##
## data: Residuals
## LM test = 10.558, df = 10, p-value = 0.3929
Second type of model that is going to build is ARIMA model. For this model, in the beginning, the data should be decomposed. Firstly, a frequency value should be chosen. Since there is no significant seasonality, the highest value in the ACF will be chosen which is 9. Additive type of decomposition will be used for this task. Below, the random series can be seen.
After the decomposition, (p,d,q) values should be chosen for the model. For this task, ACF and PACF will be examined. Looking at the ACF, for ‘q’ value 3 can be chosen and looking at the PACF, for ‘p’ value 3 or 6 can be chosen. Also, auto.arima function is used as well. The AIC and BIC values of models that are suggested can be seen below. Looking at AIC and BIC values, (6,0,3) model is best among them. After the model is selected, the regressors that most correlates with the sold count are added to model to make it better. In the final model, the AIC and BIC values are lower. We can proceed with this model.
##
## Call:
## arima(x = detrend, order = c(3, 0, 3))
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 ma3 intercept
## 0.8921 -0.0218 -0.3943 -1.1329 -0.1835 0.3164 -0.0022
## s.e. 0.5828 0.8229 0.4372 0.5875 0.9698 0.3877 0.0030
##
## sigma^2 estimated as 31.97: log likelihood = -1141.24, aic = 2298.49
## [1] 2298.488
## [1] 2329.599
##
## Call:
## arima(x = detrend, order = c(6, 0, 3))
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ar6 ma1 ma2
## 0.3835 0.1168 -0.3827 -0.0999 -0.0047 -0.2076 -0.6080 -0.4489
## s.e. 0.2537 0.2274 0.1682 0.0954 0.0762 0.0602 0.2604 0.2583
## ma3 intercept
## 0.0569 -0.0022
## s.e. 0.2284 0.0032
##
## sigma^2 estimated as 31.25: log likelihood = -1137.14, aic = 2296.28
## [1] 2296.28
## [1] 2339.058
## Series: detrend
## ARIMA(0,0,1) with non-zero mean
##
## Coefficients:
## ma1 mean
## 0.2197 -0.0189
## s.e. 0.0486 0.4641
##
## sigma^2 estimated as 52.61: log likelihood=-1226.57
## AIC=2459.15 AICc=2459.22 BIC=2470.81
## [1] 2459.148
## [1] 2470.814
##
## Call:
## arima(x = detrend, order = c(6, 0, 3), xreg = xreg)
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ar6 ma1 ma2
## 0.6479 0.2397 -0.6091 0.0157 0.0723 -0.1585 -0.8846 -0.5082
## s.e. 0.2398 0.2883 0.2012 0.0882 0.0822 0.0806 0.2410 0.3167
## ma3 intercept xreg
## 0.4353 -0.3086 0.0018
## s.e. 0.2588 0.1347 0.0008
##
## sigma^2 estimated as 30.83: log likelihood = -1133.08, aic = 2290.16
## [1] 2290.163
## [1] 2336.829
We selected two models for prediction. Here, it can be seen their accuracy values. According to box plot, the weighted mean absolute errors for Arima model is higher. We should choose Linear model because WMAPE value of the model is lower which is a sign for better model.
## variable n mean sd CV FBias MAPE RMSE
## 1: lm_prediction 14 50.71429 11.75015 0.231693 0.07376177 0.1491078 10.37096
## 2: selected_arima 14 50.71429 11.75015 0.231693 0.05527895 0.2515232 15.59437
## MAD MADP WMAPE
## 1: 8.006217 0.1578691 0.1578691
## 2: 12.673371 0.2498975 0.2498975
For conclusion, here is a plot of actual test set and predicted values of chosen model. As it can be seen, the predictions are pretty accurate.
First of all, the general behaviour of data is examined during the day by time plot.
Secocondly, the distribution in days and months is plotted to see if it is changed depend on month and day.
Finally , by ACF and PACF graph, the relationship between previous observations is observed.
It can be say that, there is a trend in data, and if trend factor is excluded, the autocorrelation between lag1, lag3 and lag is significant.
The data is depend on month and day factor by observing boxplot of data. Since the day factor is significant, day factor will be used in model construction instead of lag7 and the frequency of data determined as 7.
The some of the attributes of data is not reliable, therefore, it is examined by summary of data.
## price event_date product_content_id sold_count
## Min. :110.1 Min. :2020-05-25 Length:405 Min. : 0.00
## 1st Qu.:129.9 1st Qu.:2020-09-03 Class :character 1st Qu.: 20.00
## Median :136.3 Median :2020-12-13 Mode :character Median : 57.00
## Mean :135.3 Mean :2020-12-13 Mean : 94.91
## 3rd Qu.:141.6 3rd Qu.:2021-03-24 3rd Qu.:139.00
## Max. :165.9 Max. :2021-07-03 Max. :513.00
## NA's :9
## visit_count favored_count basket_count category_sold
## Min. : 0 Min. : 0 Min. : 0.0 Min. : 321
## 1st Qu.: 0 1st Qu.: 0 1st Qu.: 92.0 1st Qu.: 610
## Median : 0 Median : 175 Median : 240.0 Median : 802
## Mean : 2267 Mean : 356 Mean : 399.2 Mean :1008
## 3rd Qu.: 4265 3rd Qu.: 588 3rd Qu.: 578.0 3rd Qu.:1099
## Max. :15725 Max. :2696 Max. :2249.0 Max. :5557
##
## category_brand_sold category_visits ty_visits category_basket
## Min. : 0 Min. : 346 Min. : 1 Min. : 0
## 1st Qu.: 0 1st Qu.: 657 1st Qu.: 1 1st Qu.: 0
## Median : 693 Median : 880 Median : 1 Median : 0
## Mean : 2991 Mean : 3896 Mean : 44737307 Mean : 18591
## 3rd Qu.: 5354 3rd Qu.: 1349 3rd Qu.:102143446 3rd Qu.: 41265
## Max. :28944 Max. :59310 Max. :178545693 Max. :281022
##
## category_favored w_day mon is_campaign
## Min. : 1242 Min. :1.000 Min. : 1.000 Min. :0.00000
## 1st Qu.: 2476 1st Qu.:2.000 1st Qu.: 4.000 1st Qu.:0.00000
## Median : 3298 Median :4.000 Median : 6.000 Median :0.00000
## Mean : 4202 Mean :4.007 Mean : 6.464 Mean :0.08642
## 3rd Qu.: 4869 3rd Qu.:6.000 3rd Qu.: 9.000 3rd Qu.:0.00000
## Max. :44445 Max. :7.000 Max. :12.000 Max. :1.00000
##
## price sold_count visit_count favored_count basket_count category_sold
## [1,] 112.9000 0 0 0 0 321
## [2,] 129.9000 20 0 0 92 610
## [3,] 136.2828 57 0 175 240 802
## [4,] 141.6109 139 4265 588 578 1099
## [5,] 158.1300 315 10646 1465 1287 1799
## category_brand_sold category_visits ty_visits category_basket
## [1,] 0 346 1 0
## [2,] 0 657 1 0
## [3,] 693 880 1 0
## [4,] 5354 1349 102143446 41265
## [5,] 12868 2348 178545693 95301
## category_favored w_day
## [1,] 1242 1
## [2,] 2476 2
## [3,] 3298 4
## [4,] 4869 6
## [5,] 8278 7
The relationship of attributes and response variable is observed by correlation grapgh.
Basket_count, category_visits and category_favored has high correlation and it seems reliable data from summary of data.However, there is 0 values which is not expected in real life therefore, the zero values are changed as mean.
ty_visits also has 1 value before particular date and it is changed as mean of ty_visits.
Some price values are NA, and they are changed as mean of price since price is not has significant changes during the time.
In the end, “price”,“visit_count”, “basket_count”,“category_favored” , “ty_visits”,“is_campaign” values determined as regressors.
the data will be predicted based on previous observations attributes since the real attributes not available for prediction time.
the data has no constant variance therefore, besides the simple linear model, the sqrt transformation and boxcox tranformation is used for simple regression model
simple linear regression with no transformation
By many iterations, it is seen that day factor is not significant as is expected.
##
## Call:
## lm(formula = sold_count ~ price + visit_count + basket_count +
## category_basket + factor(mon) + factor(is_campaign) + trend +
## lag1 + lag3, data = train7)
##
## Residuals:
## Min 1Q Median 3Q Max
## -121.274 -9.266 -0.198 7.729 121.941
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.364e+01 3.324e+01 1.914 0.056364 .
## price -7.582e-01 2.381e-01 -3.184 0.001576 **
## visit_count -1.032e-02 1.641e-03 -6.287 9.13e-10 ***
## basket_count 2.258e-01 1.012e-02 22.312 < 2e-16 ***
## category_basket 2.713e-04 8.129e-05 3.338 0.000931 ***
## factor(mon)2 -8.347e+00 8.170e+00 -1.022 0.307592
## factor(mon)3 -1.863e+01 7.361e+00 -2.530 0.011807 *
## factor(mon)4 -1.232e+01 8.375e+00 -1.471 0.142073
## factor(mon)5 2.585e+01 8.131e+00 3.179 0.001602 **
## factor(mon)6 2.282e+01 6.630e+00 3.442 0.000643 ***
## factor(mon)7 2.496e+01 7.462e+00 3.344 0.000909 ***
## factor(mon)8 1.643e+01 7.158e+00 2.296 0.022238 *
## factor(mon)9 -1.057e+00 7.680e+00 -0.138 0.890645
## factor(mon)10 -5.613e-01 6.755e+00 -0.083 0.933815
## factor(mon)11 4.073e+00 6.533e+00 0.623 0.533369
## factor(mon)12 -3.015e+00 5.780e+00 -0.522 0.602269
## factor(is_campaign)1 7.251e-01 4.654e+00 0.156 0.876270
## trend 1.782e-01 2.499e-02 7.130 5.32e-12 ***
## lag1 1.754e-01 2.743e-02 6.395 4.84e-10 ***
## lag3 4.160e-02 2.170e-02 1.917 0.056001 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.41 on 370 degrees of freedom
## Multiple R-squared: 0.9504, Adjusted R-squared: 0.9478
## F-statistic: 372.9 on 19 and 370 DF, p-value: < 2.2e-16
##
## Breusch-Godfrey test for serial correlation of order up to 23
##
## data: Residuals
## LM test = 58.281, df = 23, p-value = 6.744e-05
the residuals analysis is good for lm model with no significant autocorrelation around mean zero, however, the variablity of error in higher values is higher.
simple linear regression with sqrt() transformation
By many iterations, the
##
## Call:
## lm(formula = sqrt ~ price + visit_count + basket_count + ty_visits +
## factor(mon) + lag1 + factor(is_campaign) + category_visits +
## category_basket, data = train7)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0045 -0.6253 0.0536 0.6764 3.7677
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.418e+01 1.818e+00 7.804 6.22e-14 ***
## price -7.325e-02 1.230e-02 -5.956 6.03e-09 ***
## visit_count -9.906e-04 9.360e-05 -10.583 < 2e-16 ***
## basket_count 1.133e-02 5.471e-04 20.703 < 2e-16 ***
## ty_visits 4.269e-08 4.811e-09 8.875 < 2e-16 ***
## factor(mon)2 -3.187e-01 4.807e-01 -0.663 0.507661
## factor(mon)3 -4.962e-01 4.251e-01 -1.167 0.243875
## factor(mon)4 -5.304e-01 4.786e-01 -1.108 0.268413
## factor(mon)5 -7.138e-01 4.597e-01 -1.553 0.121313
## factor(mon)6 -1.599e+00 3.481e-01 -4.594 5.98e-06 ***
## factor(mon)7 -1.911e+00 3.513e-01 -5.440 9.70e-08 ***
## factor(mon)8 -1.667e+00 3.690e-01 -4.517 8.44e-06 ***
## factor(mon)9 -1.600e+00 4.209e-01 -3.801 0.000169 ***
## factor(mon)10 -1.698e+00 3.656e-01 -4.645 4.74e-06 ***
## factor(mon)11 -1.202e+00 3.547e-01 -3.388 0.000779 ***
## factor(mon)12 -3.743e-01 3.134e-01 -1.194 0.233053
## lag1 1.217e-02 1.313e-03 9.275 < 2e-16 ***
## factor(is_campaign)1 -2.161e-01 2.537e-01 -0.852 0.395006
## category_visits 4.162e-05 1.234e-05 3.371 0.000827 ***
## category_basket -1.017e-05 4.776e-06 -2.129 0.033882 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.217 on 370 degrees of freedom
## Multiple R-squared: 0.9399, Adjusted R-squared: 0.9368
## F-statistic: 304.7 on 19 and 370 DF, p-value: < 2.2e-16
##
## Breusch-Godfrey test for serial correlation of order up to 23
##
## data: Residuals
## LM test = 95.006, df = 23, p-value = 1.012e-10
the residuals analysis is model with significant autocorrelation in lag1 around mean zero, however, the variablity of error in higher values is higher. It is poor by comparing lm model with no transformation.
simple linear regression with BoxCox transformation
By many iterations, it is seen that day factornot significant but lag7 is significant and lag3 factor is not significant for boxcox linear model therefore, they excluded and the category_basket is significant for boxcox transformation.
##
## Call:
## lm(formula = BoxCox ~ price + visit_count + basket_count + category_favored +
## ty_visits + factor(mon) + lag1 + lag7 + factor(is_campaign) +
## category_basket, data = train7)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.7760 -0.3972 0.1812 0.6479 2.3175
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.278e+01 2.177e+00 5.872 9.63e-09 ***
## price -6.817e-02 1.482e-02 -4.599 5.85e-06 ***
## visit_count -5.974e-04 1.195e-04 -4.999 8.92e-07 ***
## basket_count 5.067e-03 7.149e-04 7.087 7.01e-12 ***
## category_favored 1.369e-04 4.166e-05 3.286 0.00111 **
## ty_visits 4.048e-08 4.666e-09 8.677 < 2e-16 ***
## factor(mon)2 2.373e-01 5.810e-01 0.409 0.68311
## factor(mon)3 -5.853e-02 5.009e-01 -0.117 0.90704
## factor(mon)4 -7.257e-01 5.495e-01 -1.321 0.18744
## factor(mon)5 -1.649e+00 5.416e-01 -3.045 0.00249 **
## factor(mon)6 -2.041e+00 4.065e-01 -5.021 8.01e-07 ***
## factor(mon)7 -2.185e+00 4.234e-01 -5.160 4.04e-07 ***
## factor(mon)8 -1.384e+00 4.420e-01 -3.132 0.00187 **
## factor(mon)9 -1.402e+00 5.038e-01 -2.783 0.00566 **
## factor(mon)10 -1.334e+00 4.359e-01 -3.060 0.00237 **
## factor(mon)11 -1.229e+00 4.308e-01 -2.852 0.00459 **
## factor(mon)12 -2.598e-01 3.750e-01 -0.693 0.48882
## lag1 6.648e-03 1.573e-03 4.227 2.98e-05 ***
## lag7 2.820e-03 1.239e-03 2.277 0.02338 *
## factor(is_campaign)1 -3.815e-01 3.167e-01 -1.205 0.22912
## category_basket -3.401e-05 7.166e-06 -4.746 2.97e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.452 on 369 degrees of freedom
## Multiple R-squared: 0.7651, Adjusted R-squared: 0.7523
## F-statistic: 60.08 on 20 and 369 DF, p-value: < 2.2e-16
##
## Breusch-Godfrey test for serial correlation of order up to 24
##
## data: Residuals
## LM test = 197.94, df = 24, p-value < 2.2e-16
By residuals analysis, boxcox model has big deviation in time and adjusted R-squared value is lower than others.
Arima Models
When arima models is constructed, the auto.arima function is used, and in every day the auto.arima function is runs again. the seasonality is TRUE, and frequency is determined as seven by observing ACF and PACF graph.
Additive Model, Multplive Model, and linear regression model is used for decomposition and get stationary data.
## [1] "The Additive Model"
##
## #######################################
## # KPSS Unit Root / Cointegration Test #
## #######################################
##
## The value of the test statistic is: 0.0069
## [1] "The Multiplicative Model"
##
## #######################################
## # KPSS Unit Root / Cointegration Test #
## #######################################
##
## The value of the test statistic is: 0.2127
## [1] "Linear Regression"
##
## #######################################
## # KPSS Unit Root / Cointegration Test #
## #######################################
##
## The value of the test statistic is: 0.0244
the multiplive model is not significant, therefore I will use the addtive decomposition for arima and arima regressors models.
the linear regression model residuals are stationary therefore, the residuals use for arima model and they combined in the end.
the regressors mentioned above is used for arima model with regressors.
Arima Model
## Series: decomposed$random
## ARIMA(0,0,1)(0,0,2)[7] with non-zero mean
##
## Coefficients:
## ma1 sma1 sma2 mean
## 0.3325 0.0897 -0.0993 0.2025
## s.e. 0.0468 0.0526 0.0522 2.2903
##
## sigma^2 estimated as 1165: log likelihood=-1898.69
## AIC=3807.38 AICc=3807.54 BIC=3827.13
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,1)(0,0,2)[7] with non-zero mean
## Q* = 52.251, df = 10, p-value = 1.025e-07
##
## Model df: 4. Total lags used: 14
By observing, pacf is significant at lag1 and acf drops after lag1 therfore, it is reasonable auto.arima gives the MA(1). And at lag2 as seasonal the pacf and acf is significant, the seasonal order(0,0,2) is reasonable, too.
## Series: decomposed$random
## Regression with ARIMA(5,1,1) errors
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ma1 xreg
## 0.174 -0.3550 -0.2993 -0.0638 -0.2540 -0.9827 -0.4743
## s.e. 0.050 0.0506 0.0515 0.0506 0.0505 0.0133 0.2082
##
## sigma^2 estimated as 940.2: log likelihood=-1853.64
## AIC=3723.27 AICc=3723.66 BIC=3754.85
##
## Ljung-Box test
##
## data: Residuals from Regression with ARIMA(5,1,1) errors
## Q* = 19.569, df = 7, p-value = 0.00658
##
## Model df: 7. Total lags used: 14
## [1] 3723.27
By residual analysis, the arima with regressors has no autocorrelated residuals and lower AIC, therefore arima with regressors is better model than arima.
Arima combined with linear Regression
## Series: residuals
## ARIMA(0,0,3) with zero mean
##
## Coefficients:
## ma1 ma2 ma3
## 0.1687 0.1556 0.0900
## s.e. 0.0504 0.0510 0.0537
##
## sigma^2 estimated as 454: log likelihood=-1744.93
## AIC=3497.86 AICc=3497.97 BIC=3513.73
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,3) with zero mean
## Q* = 3.6185, df = 7, p-value = 0.8225
##
## Model df: 3. Total lags used: 10
the auto arima model on residuals give zero mean and no autocorrelated residuals and lower AIC valu, it is better than arima and arima regressor model by residual analysis.
The predictions based on the last available attributes, and the predictions plotted with actual sales values.
## event_date actual sqrt_forecasted_sold BoxCox_forecasted_sold
## 1: 2021-06-19 104 85.19013 46.48724
## 2: 2021-06-20 149 142.34921 105.17396
## 3: 2021-06-21 128 116.80927 111.63219
## 4: 2021-06-22 56 97.96816 82.82714
## 5: 2021-06-23 59 65.41235 53.70646
## 6: 2021-06-24 56 63.05931 51.09164
## 7: 2021-06-25 36 55.53599 42.38736
## 8: 2021-06-26 40 52.72843 38.58299
## 9: 2021-06-27 46 72.90855 71.13411
## 10: 2021-06-28 64 73.59749 64.87460
## 11: 2021-06-29 137 120.37899 111.99730
## 12: 2021-06-30 131 133.14419 127.36637
## 13: 2021-07-01 130 106.68538 86.68351
## 14: 2021-07-02 108 97.23984 79.91181
## lm_forecasted_sold forecasted_lm7_arima add_arima_forecasted
## 1: 122.10110 116.96870 151.61943
## 2: 156.84887 152.07278 145.17374
## 3: 130.65940 126.83047 158.80660
## 4: 108.49373 107.28933 154.81502
## 5: 82.48148 74.54122 134.35688
## 6: 77.32606 67.77444 120.78775
## 7: 69.69560 61.49401 97.99568
## 8: 70.19297 63.37929 74.91833
## 9: 67.03113 58.76189 57.04774
## 10: 79.05741 71.53800 52.04136
## 11: 131.06600 126.11470 55.21875
## 12: 140.90974 140.62073 74.90336
## 13: 139.55899 139.03424 87.46261
## 14: 124.65919 127.33803 86.98584
## reg_add_arima_forecasted
## 1: 152.31646
## 2: 145.80627
## 3: 176.87708
## 4: 175.92230
## 5: 140.35727
## 6: 109.98016
## 7: 94.61315
## 8: 79.87082
## 9: 65.01756
## 10: 56.92829
## 11: 51.75949
## 12: 68.10496
## 13: 84.87163
## 14: 87.40229
## model n mean sd CV FBias
## 1: sqrt_forecasted_sold 14 88.85714 41.34072 0.4652492 -0.03135634
## 2: BoxCox_forecasted_sold 14 88.85714 41.34072 0.4652492 0.13677116
## 3: lm_forecasted_sold 14 88.85714 41.34072 0.4652492 -0.20585343
## 4: forecasted_lm7_arima 14 88.85714 41.34072 0.4652492 -0.15253843
## 5: add_arima_forecasted 14 88.85714 41.34072 0.4652492 -0.16730956
## 6: reg_add_arima_forecasted 14 88.85714 41.34072 0.4652492 -0.19761071
## MAPE RMSE MAD MADP WMAPE
## 1: 0.2363980 18.27491 15.26440 0.1717859 0.1717859
## 2: 0.2291337 27.05425 20.61355 0.2319853 0.2319853
## 3: 0.3352665 22.95409 19.13926 0.2153936 0.2153936
## 4: 0.2595226 19.40657 15.27625 0.1719192 0.1719192
## 5: 0.6779985 53.64278 45.89727 0.5165288 0.5165288
## 6: 0.7243721 58.41190 49.57728 0.5579436 0.5579436
Since the arima model combined model has the lowest WMAPE value it is selected for prediction. However, In every day, the error rates are calculated for last 14 days and the model predictions and the model prediction has the lowest WMAPE value of is selected.
## add_arima xreg_add_arima forecast_lm forecast_lm_arima
## 85.54331 89.25217 118.33530 120.30736
## BoxCox_lm Sqrt_lm
## 78.85804 94.00000
It can be seen that the sales is zero most of time, however, there is huge increase in October.
The ACF and PACF of data shows that there is significant autocorrelation in lag1 and lag7.
the correlation of price, visit_count, and basket_count is high and it is expected if the sold_count is zero this variables can be zero.
However, it is not expected that category favored and trendyol visits is zero or one therefore these variables changed as mean.
## price event_date product_content_id sold_count
## Min. : -1.0 Min. :2020-05-25 Length:405 Min. : 0.0000
## 1st Qu.:350.0 1st Qu.:2020-09-03 Class :character 1st Qu.: 0.0000
## Median :600.0 Median :2020-12-13 Mode :character Median : 0.0000
## Mean :559.3 Mean :2020-12-13 Mean : 0.9284
## 3rd Qu.:734.3 3rd Qu.:2021-03-24 3rd Qu.: 0.0000
## Max. :833.3 Max. :2021-07-03 Max. :52.0000
## NA's :303
## visit_count favored_count basket_count category_sold
## Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 16.0
## Median : 0.00 Median : 0.000 Median : 0.00 Median : 45.0
## Mean : 27.24 Mean : 2.242 Mean : 5.83 Mean : 200.2
## 3rd Qu.: 3.00 3rd Qu.: 2.000 3rd Qu.: 5.00 3rd Qu.: 111.0
## Max. :516.00 Max. :37.000 Max. :247.00 Max. :3299.0
##
## category_brand_sold category_visits ty_visits category_basket
## Min. : 0 Min. : 367 Min. : 1 Min. : 0
## 1st Qu.: 0 1st Qu.: 1432 1st Qu.: 1 1st Qu.: 0
## Median : 6 Median : 5324 Median : 1 Median : 0
## Mean : 46247 Mean : 27767 Mean : 44737307 Mean : 353021
## 3rd Qu.: 94562 3rd Qu.: 9538 3rd Qu.:102143446 3rd Qu.: 464380
## Max. :259590 Max. :583672 Max. :178545693 Max. :3102147
##
## category_favored w_day mon is_campaign
## Min. : 2324 Min. :1.000 Min. : 1.000 Min. :0.00000
## 1st Qu.: 8618 1st Qu.:2.000 1st Qu.: 4.000 1st Qu.:0.00000
## Median : 24534 Median :4.000 Median : 6.000 Median :0.00000
## Mean : 33688 Mean :4.007 Mean : 6.464 Mean :0.08642
## 3rd Qu.: 50341 3rd Qu.:6.000 3rd Qu.: 9.000 3rd Qu.:0.00000
## Max. :244883 Max. :7.000 Max. :12.000 Max. :1.00000
##
## price sold_count visit_count favored_count basket_count category_sold
## [1,] -1.00 0 0 0 0 0
## [2,] 349.99 0 0 0 0 16
## [3,] 599.98 0 0 0 0 45
## [4,] 736.64 0 3 2 5 111
## [5,] 833.32 0 7 5 12 248
## category_brand_sold category_visits ty_visits category_basket
## [1,] 0 367 1 0
## [2,] 0 1432 1 0
## [3,] 6 5324 1 0
## [4,] 94562 9538 102143446 464380
## [5,] 235840 21187 178545693 1158593
## category_favored w_day
## [1,] 2324 1
## [2,] 8618 2
## [3,] 24534 4
## [4,] 50341 6
## [5,] 111346 7
By considering correlation and variable relaibility the “price”,“visit_count”, “basket_count”,“category_favored” are selected as regressors.
The acf and pacf garph is shows high correlation in lag1,lag2,lag5 and lag7 therefore they are added as attirbutes.
Since Jacket is expensive product, it is expected that consumers consider the previous price of jacket. Therefore, previous prices of Jacket is examined.
the data will be predicted based on previous observations attributes since the real attributes not available for prediction time.
the data has no constant variance therefore, besides the simple linear model, the sqrt transformation and boxcox tranformation is used for simple regression model
Simple Regression
By many iterations, it is seen that most significant variables are price, visit_count, basket_count, category_favored,factor( w_day ), factor(mon),lag1,lag2,price_lag_4.
##
## Call:
## lm(formula = sold_count ~ price + visit_count + basket_count +
## category_favored + factor(w_day) + factor(mon) + lag1 + lag2 +
## price_lag_4, data = train8)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.7381 -0.2841 -0.0468 0.2941 6.6801
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.330e+00 3.547e-01 3.750 0.000205 ***
## price 1.418e-03 4.286e-04 3.309 0.001029 **
## visit_count 1.290e-03 1.189e-03 1.085 0.278473
## basket_count 1.875e-01 4.545e-03 41.246 < 2e-16 ***
## category_favored -2.489e-05 3.524e-06 -7.062 8.33e-12 ***
## factor(w_day)2 4.478e-01 2.281e-01 1.963 0.050385 .
## factor(w_day)3 3.275e-01 2.294e-01 1.428 0.154278
## factor(w_day)4 5.850e-01 2.294e-01 2.550 0.011173 *
## factor(w_day)5 4.696e-01 2.311e-01 2.032 0.042920 *
## factor(w_day)6 3.276e-01 2.303e-01 1.422 0.155776
## factor(w_day)7 1.595e-01 2.291e-01 0.696 0.486652
## factor(mon)2 -5.766e-02 3.193e-01 -0.181 0.856773
## factor(mon)3 -4.079e-01 3.128e-01 -1.304 0.193058
## factor(mon)4 -7.966e-01 3.313e-01 -2.404 0.016694 *
## factor(mon)5 -1.088e+00 3.607e-01 -3.016 0.002739 **
## factor(mon)6 -1.503e+00 3.585e-01 -4.192 3.47e-05 ***
## factor(mon)7 -1.489e+00 3.760e-01 -3.960 9.01e-05 ***
## factor(mon)8 -1.373e+00 3.684e-01 -3.727 0.000224 ***
## factor(mon)9 -1.250e+00 3.570e-01 -3.502 0.000520 ***
## factor(mon)10 6.976e-01 4.587e-01 1.521 0.129181
## factor(mon)11 -1.549e+00 3.584e-01 -4.323 1.99e-05 ***
## factor(mon)12 -9.394e-02 3.089e-01 -0.304 0.761183
## lag1 -5.277e-05 2.168e-02 -0.002 0.998059
## lag2 -7.379e-02 2.135e-02 -3.456 0.000612 ***
## price_lag_4 -2.195e-03 3.865e-04 -5.679 2.76e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 365 degrees of freedom
## Multiple R-squared: 0.8894, Adjusted R-squared: 0.8821
## F-statistic: 122.3 on 24 and 365 DF, p-value: < 2.2e-16
##
## Breusch-Godfrey test for serial correlation of order up to 28
##
## data: Residuals
## LM test = 150.89, df = 28, p-value < 2.2e-16
Simple Linear Regression with sqrt() transformation
By many iteration, it is seen that the plag_4 and lag_2 is not significant for sqrt transformation model, lag5 is significant.
##
## Call:
## lm(formula = sqrt ~ price + visit_count + basket_count + category_favored +
## factor(w_day) + factor(mon) + lag1 + lag5, data = train8)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.14135 -0.06833 0.00032 0.05779 1.38530
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.452e-01 8.630e-02 -1.683 0.093257 .
## price 1.862e-03 1.049e-04 17.759 < 2e-16 ***
## visit_count 1.884e-03 2.897e-04 6.502 2.60e-10 ***
## basket_count 2.595e-02 1.117e-03 23.222 < 2e-16 ***
## category_favored -2.433e-06 8.906e-07 -2.732 0.006599 **
## factor(w_day)2 1.256e-01 5.663e-02 2.217 0.027225 *
## factor(w_day)3 7.640e-02 5.668e-02 1.348 0.178490
## factor(w_day)4 9.278e-02 5.665e-02 1.638 0.102333
## factor(w_day)5 4.810e-02 5.707e-02 0.843 0.399890
## factor(w_day)6 5.388e-02 5.672e-02 0.950 0.342781
## factor(w_day)7 1.172e-02 5.663e-02 0.207 0.836212
## factor(mon)2 -1.259e-01 7.890e-02 -1.596 0.111264
## factor(mon)3 -6.078e-02 7.733e-02 -0.786 0.432396
## factor(mon)4 -9.304e-02 8.204e-02 -1.134 0.257475
## factor(mon)5 5.873e-02 8.940e-02 0.657 0.511630
## factor(mon)6 -1.577e-01 8.895e-02 -1.774 0.076969 .
## factor(mon)7 -1.579e-01 9.348e-02 -1.689 0.092049 .
## factor(mon)8 -1.472e-01 9.156e-02 -1.607 0.108821
## factor(mon)9 -1.489e-01 8.865e-02 -1.679 0.093921 .
## factor(mon)10 1.899e-02 1.055e-01 0.180 0.857272
## factor(mon)11 -2.973e-01 8.586e-02 -3.463 0.000598 ***
## factor(mon)12 -7.452e-02 7.598e-02 -0.981 0.327344
## lag1 2.248e-02 5.270e-03 4.266 2.54e-05 ***
## lag5 -8.956e-03 5.015e-03 -1.786 0.074978 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2966 on 366 degrees of freedom
## Multiple R-squared: 0.8919, Adjusted R-squared: 0.8851
## F-statistic: 131.3 on 23 and 366 DF, p-value: < 2.2e-16
##
## Breusch-Godfrey test for serial correlation of order up to 27
##
## data: Residuals
## LM test = 138.68, df = 27, p-value < 2.2e-16
In residual analysis there is no significant difference, and adjusted R-square value of squared transformation is higher.
Simple Linear Regression Model with BoxCox Transformation
By many iteration, price, visit_count, basket_count, category_favored, factor( w_day ), factor(mon), lag1 are most significant variables for Simple Linear Regression Model with BoxCox Transformation.
##
## Call:
## lm(formula = BoxCox ~ price + visit_count + basket_count + category_favored +
## factor(w_day) + factor(mon) + lag1, data = train8)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0463 -0.1968 -0.0363 0.1387 4.1884
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.686e+00 2.907e-01 -29.884 < 2e-16 ***
## price 1.198e-02 3.529e-04 33.939 < 2e-16 ***
## visit_count 9.627e-03 9.792e-04 9.831 < 2e-16 ***
## basket_count 2.921e-02 3.754e-03 7.781 7.38e-14 ***
## category_favored -8.752e-07 2.941e-06 -0.298 0.766192
## factor(w_day)2 2.736e-01 1.911e-01 1.432 0.152962
## factor(w_day)3 1.992e-01 1.920e-01 1.037 0.300208
## factor(w_day)4 1.375e-01 1.921e-01 0.716 0.474649
## factor(w_day)5 -1.911e-03 1.935e-01 -0.010 0.992125
## factor(w_day)6 1.953e-01 1.923e-01 1.016 0.310483
## factor(w_day)7 -3.231e-02 1.920e-01 -0.168 0.866440
## factor(mon)2 -6.849e-01 2.675e-01 -2.560 0.010856 *
## factor(mon)3 -2.233e-01 2.619e-01 -0.852 0.394519
## factor(mon)4 -2.158e-01 2.773e-01 -0.778 0.436990
## factor(mon)5 1.364e+00 3.013e-01 4.527 8.08e-06 ***
## factor(mon)6 -1.162e-01 2.995e-01 -0.388 0.698387
## factor(mon)7 -2.235e-01 3.144e-01 -0.711 0.477592
## factor(mon)8 -2.174e-01 3.080e-01 -0.706 0.480741
## factor(mon)9 -2.665e-01 2.986e-01 -0.892 0.372773
## factor(mon)10 -3.659e-01 3.567e-01 -1.026 0.305735
## factor(mon)11 -5.961e-01 2.829e-01 -2.107 0.035797 *
## factor(mon)12 -2.368e-01 2.576e-01 -0.919 0.358460
## lag1 6.100e-02 1.775e-02 3.436 0.000658 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.005 on 367 degrees of freedom
## Multiple R-squared: 0.918, Adjusted R-squared: 0.9131
## F-statistic: 186.8 on 22 and 367 DF, p-value: < 2.2e-16
##
## Breusch-Godfrey test for serial correlation of order up to 26
##
## data: Residuals
## LM test = 114.48, df = 26, p-value = 4.496e-13
In residual analysis and adjusted R-squared comparison BoxCox is better than others, however, it is very sensitive to back transformation, therefore, maybe predictions can be poor.
Arima Models
When arima models is constructed, the auto.arima function is used, and in every day the auto.arima function is runs again. the seasonality is TRUE, and frequency is determined as seven by observing ACF and PACF graph.
Additive Model, Multplive Model, and linear regression model is used for decomposition and get stationary data.
## [1] "The Additive Model"
##
## #######################################
## # KPSS Unit Root / Cointegration Test #
## #######################################
##
## The value of the test statistic is: 0.0089
## [1] "The Multiplicative Model"
##
## #######################################
## # KPSS Unit Root / Cointegration Test #
## #######################################
##
## The value of the test statistic is: 0.069
## [1] "Linear Regression"
##
## #######################################
## # KPSS Unit Root / Cointegration Test #
## #######################################
##
## The value of the test statistic is: 0.0142
the multiplive model is significant at alpha level = .10, therefore I will use the addtive decomposition for arima and arima regressors models.
the linear regression model residuals are stationary therefore, the residuals use for arima model and they combined in the end.
the regressors mentioned above is used for arima model with regressors.
Arima
## Series: decomposed$random
## ARIMA(5,0,0) with zero mean
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5
## -0.1935 -0.4373 -0.4021 -0.3274 -0.1575
## s.e. 0.0504 0.0485 0.0492 0.0483 0.0502
##
## sigma^2 estimated as 5.192: log likelihood=-859.14
## AIC=1730.28 AICc=1730.51 BIC=1753.99
##
## Ljung-Box test
##
## data: Residuals from ARIMA(5,0,0) with zero mean
## Q* = 54.422, df = 9, p-value = 1.569e-08
##
## Model df: 5. Total lags used: 14
Arima with Regressor
## Series: decomposed$random
## Regression with ARIMA(0,0,0)(0,0,2)[7] errors
##
## Coefficients:
## sma1 sma2 intercept xreg
## 0.1989 -0.1040 -0.3089 0.0013
## s.e. 0.0505 0.0501 0.2072 0.0006
##
## sigma^2 estimated as 6.874: log likelihood=-913.24
## AIC=1836.49 AICc=1836.65 BIC=1856.24
##
## Ljung-Box test
##
## data: Residuals from Regression with ARIMA(0,0,0)(0,0,2)[7] errors
## Q* = 100.13, df = 10, p-value < 2.2e-16
##
## Model df: 4. Total lags used: 14
Arima combined with linear Regression
## Series: residuals
## ARIMA(1,0,4) with zero mean
##
## Coefficients:
## ar1 ma1 ma2 ma3 ma4
## 0.7355 -0.7514 0.1745 -0.1317 -0.2326
## s.e. 0.0615 0.0740 0.0651 0.0736 0.0599
##
## sigma^2 estimated as 1.142: log likelihood=-577.29
## AIC=1166.59 AICc=1166.81 BIC=1190.39
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,4) with zero mean
## Q* = 8.9276, df = 5, p-value = 0.112
##
## Model df: 5. Total lags used: 10
The all models are used to predict include mul_arima and reg_mul_arima since they significant at alpha = 0.10.
## event_date actual sqrt_forecasted_sold BoxCox_forecasted_sold
## 1: 2021-06-19 0 0 0
## 2: 2021-06-20 1 2 3
## 3: 2021-06-21 2 2 3
## 4: 2021-06-22 2 1 0
## 5: 2021-06-23 2 1 0
## 6: 2021-06-24 2 1 0
## 7: 2021-06-25 2 1 0
## 8: 2021-06-26 1 0 0
## 9: 2021-06-27 0 0 0
## 10: 2021-06-28 4 1 0
## 11: 2021-06-29 1 3 6
## 12: 2021-06-30 0 0 0
## 13: 2021-07-01 1 1 1
## 14: 2021-07-02 2 2 2
## lm_forecasted_sold forecasted_lm8_arima add_arima_forecasted
## 1: -1 -1 2
## 2: 1 2 3
## 3: 1 2 3
## 4: 1 0 2
## 5: 1 1 2
## 6: 0 0 2
## 7: 0 0 1
## 8: 0 1 1
## 9: -1 0 1
## 10: 1 1 1
## 11: 2 2 2
## 12: -1 -1 2
## 13: 2 2 3
## 14: 1 1 1
## mul_arima_forecasted reg_add_arima_forecasted reg_mul_arima_forecasted
## 1: 2 2 0
## 2: 1 3 0
## 3: 2 3 5
## 4: 2 2 5
## 5: 2 2 3
## 6: 2 2 -1
## 7: 1 1 0
## 8: 1 1 1
## 9: 1 1 1
## 10: 1 1 1
## 11: 2 2 2
## 12: 2 2 6
## 13: 2 3 0
## 14: 1 1 4
## model n mean sd CV FBias MAPE RMSE
## 1: sqrt_forecasted_sold 14 1.428571 1.08941 0.7625867 0.25 NaN 1.164965
## 2: BoxCox_forecasted_sold 14 1.428571 1.08941 0.7625867 0.25 NaN 2.121320
## 3: lm_forecasted_sold 14 1.428571 1.08941 0.7625867 0.65 Inf 1.388730
## 4: forecasted_lm8_arima 14 1.428571 1.08941 0.7625867 0.50 NaN 1.414214
## 5: add_arima_forecasted 14 1.428571 1.08941 0.7625867 -0.30 Inf 1.463850
## 6: mul_arima_forecasted 14 1.428571 1.08941 0.7625867 -0.10 Inf 1.253566
## 7: reg_add_arima_forecasted 14 1.428571 1.08941 0.7625867 -0.30 Inf 1.463850
## 8: reg_mul_arima_forecasted 14 1.428571 1.08941 0.7625867 -0.35 NaN 2.464027
## MAD MADP WMAPE
## 1: 0.7857143 0.55 0.55
## 2: 1.5000000 1.05 1.05
## 3: 1.2142857 0.85 0.85
## 4: 1.1428571 0.80 0.80
## 5: 1.1428571 0.80 0.80
## 6: 0.8571429 0.60 0.60
## 7: 1.1428571 0.80 0.80
## 8: 1.9285714 1.35 1.35
The error rates are very high, however the range of response variable too narrow, therefore, it is expected. Like if the sales = 1 and the prediction is equal= 2 the error rate will be %100.
The mul_arima_forecasted has the lowest error rate.
In every day, the error rates are calculated for last 14 days and the model predictions and the model prediction has the lowest WMAPE value of is selected.
## add_arima mul_arima xreg_mul_arima xreg_add_arima
## 1.5234013 0.4952038 1.1665486 1.5962115
## forecast_lm forecast_lm_arima.1 BoxCox_lm Sqrt_lm
## 0.6147960 0.3585320 1.6999947 2.0000000
By observing the graph below, the month effect is clearly observable. It is expected since bikini is wore in hot seasons in Turkey. Moreover, by examined the acf and pacf graph, it can be said that there is trend in data and correlation with lag1 and lag7.
the “price”,“category_sold”, “basket_count”,“category_favored” attributes are more relaible and significantly corralet with data. Even if the visit_count and favored_count is very high corraleted with data, they also corraleted with basket_count therefore they do not used in regressors.
## price event_date product_content_id sold_count
## Min. :59.99 Min. :2020-05-25 Length:405 Min. : 0.00
## 1st Qu.:59.99 1st Qu.:2020-09-03 Class :character 1st Qu.: 0.00
## Median :59.99 Median :2020-12-13 Mode :character Median : 0.00
## Mean :60.11 Mean :2020-12-13 Mean : 18.35
## 3rd Qu.:59.99 3rd Qu.:2021-03-24 3rd Qu.: 3.00
## Max. :63.55 Max. :2021-07-03 Max. :286.00
## NA's :281
## visit_count favored_count basket_count category_sold
## Min. : 0 Min. : 0.0 Min. : 0.00 Min. : 20
## 1st Qu.: 0 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 132
## Median : 0 Median : 0.0 Median : 0.00 Median : 563
## Mean : 2457 Mean : 240.8 Mean : 88.64 Mean :1301
## 3rd Qu.: 589 3rd Qu.: 112.0 3rd Qu.: 19.00 3rd Qu.:1676
## Max. :45833 Max. :5011.0 Max. :1735.00 Max. :8099
##
## category_brand_sold category_visits ty_visits category_basket
## Min. : 0 Min. : 107 Min. : 1 Min. : 0
## 1st Qu.: 0 1st Qu.: 397 1st Qu.: 1 1st Qu.: 0
## Median : 2965 Median : 1362 Median : 1 Median : 0
## Mean : 14028 Mean : 82604 Mean : 44737307 Mean : 118415
## 3rd Qu.: 15079 3rd Qu.: 2871 3rd Qu.:102143446 3rd Qu.: 101167
## Max. :152168 Max. :1335060 Max. :178545693 Max. :1230833
##
## category_favored w_day mon is_campaign
## Min. : 628 Min. :1.000 Min. : 1.000 Min. :0.00000
## 1st Qu.: 2589 1st Qu.:2.000 1st Qu.: 4.000 1st Qu.:0.00000
## Median : 7843 Median :4.000 Median : 6.000 Median :0.00000
## Mean : 15287 Mean :4.007 Mean : 6.464 Mean :0.08642
## 3rd Qu.: 16401 3rd Qu.:6.000 3rd Qu.: 9.000 3rd Qu.:0.00000
## Max. :135551 Max. :7.000 Max. :12.000 Max. :1.00000
##
The trend, lag1,lag2,lag3, and lag7 variables are added data.
the data has no constant variance therefore, besides the simple linear model, the sqrt transformation and boxcox tranformation is used for simple regression model
In product9, attributes are reliable therefore the all attributes are tried to add model and most significance ones selected for the model .
simple linear regression with no transformation
##
## Call:
## lm(formula = sold_count ~ price + visit_count + basket_count +
## favored_count + category_sold + category_visits + category_basket +
## category_favored + category_brand_sold + factor(w_day) +
## factor(mon) + trend + lag1 + lag3, data = train9)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.248 -1.112 -0.030 1.443 31.628
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.410e+02 1.085e+02 -2.221 0.026981 *
## price 4.057e+00 1.816e+00 2.234 0.026091 *
## visit_count -1.082e-03 6.195e-04 -1.747 0.081498 .
## basket_count 2.038e-01 7.661e-03 26.600 < 2e-16 ***
## favored_count -4.815e-03 4.388e-03 -1.097 0.273267
## category_sold 5.148e-03 8.736e-04 5.893 8.74e-09 ***
## category_visits 3.453e-06 8.701e-06 0.397 0.691717
## category_basket 3.348e-05 1.498e-05 2.234 0.026105 *
## category_favored -3.223e-04 8.046e-05 -4.005 7.53e-05 ***
## category_brand_sold -1.555e-04 1.243e-04 -1.251 0.211659
## factor(w_day)2 -1.574e+00 1.128e+00 -1.396 0.163695
## factor(w_day)3 7.633e-01 1.149e+00 0.664 0.506889
## factor(w_day)4 1.103e-01 1.151e+00 0.096 0.923689
## factor(w_day)5 3.665e-02 1.152e+00 0.032 0.974643
## factor(w_day)6 -2.088e-01 1.136e+00 -0.184 0.854199
## factor(w_day)7 5.030e-01 1.137e+00 0.442 0.658438
## factor(mon)2 -7.007e+00 1.823e+00 -3.842 0.000144 ***
## factor(mon)3 -7.194e+00 1.758e+00 -4.092 5.29e-05 ***
## factor(mon)4 -7.014e+00 1.985e+00 -3.534 0.000463 ***
## factor(mon)5 -9.773e+00 3.846e+00 -2.541 0.011477 *
## factor(mon)6 -6.645e+00 3.545e+00 -1.874 0.061692 .
## factor(mon)7 -3.371e+00 3.161e+00 -1.066 0.286926
## factor(mon)8 -5.443e-01 2.762e+00 -0.197 0.843853
## factor(mon)9 -1.190e+00 2.519e+00 -0.472 0.636947
## factor(mon)10 -2.114e+00 2.274e+00 -0.930 0.353123
## factor(mon)11 -1.897e+00 2.053e+00 -0.924 0.355932
## factor(mon)12 -1.069e+00 1.671e+00 -0.640 0.522666
## trend -6.034e-03 1.412e-02 -0.427 0.669394
## lag1 9.079e-02 2.380e-02 3.814 0.000161 ***
## lag3 7.836e-02 1.787e-02 4.385 1.53e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.85 on 360 degrees of freedom
## Multiple R-squared: 0.9861, Adjusted R-squared: 0.985
## F-statistic: 880.1 on 29 and 360 DF, p-value: < 2.2e-16
##
## Breusch-Godfrey test for serial correlation of order up to 33
##
## data: Residuals
## LM test = 166.95, df = 33, p-value < 2.2e-16
The Adjusted R-squared value is very high and residuals seems to no autocorraled arround the mean zero. The model is a can be good fit.
Simple Linear Regression Model with sqrt transformation
##
## Call:
## lm(formula = sqrt ~ price + visit_count + basket_count + favored_count +
## category_sold + category_visits + category_basket + category_favored +
## category_brand_sold + ty_visits + factor(w_day) + factor(mon) +
## lag1 + lag3, data = train9)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3358 -0.2404 -0.0610 0.1759 4.8481
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.256e+01 1.292e+01 -0.972 0.33175
## price 2.086e-01 2.149e-01 0.970 0.33246
## visit_count 4.802e-05 7.202e-05 0.667 0.50537
## basket_count 1.009e-02 9.288e-04 10.862 < 2e-16 ***
## favored_count -6.224e-04 4.932e-04 -1.262 0.20779
## category_sold 5.387e-04 1.068e-04 5.046 7.19e-07 ***
## category_visits 2.765e-06 8.455e-07 3.271 0.00118 **
## category_basket 3.112e-06 1.908e-06 1.631 0.10379
## category_favored -4.993e-05 9.264e-06 -5.390 1.28e-07 ***
## category_brand_sold -5.552e-06 1.569e-05 -0.354 0.72361
## ty_visits 1.457e-08 3.008e-09 4.843 1.90e-06 ***
## factor(w_day)2 1.471e-01 1.361e-01 1.081 0.28039
## factor(w_day)3 2.979e-01 1.380e-01 2.158 0.03156 *
## factor(w_day)4 3.219e-01 1.386e-01 2.323 0.02074 *
## factor(w_day)5 3.376e-01 1.387e-01 2.435 0.01539 *
## factor(w_day)6 3.561e-01 1.369e-01 2.602 0.00966 **
## factor(w_day)7 2.701e-01 1.364e-01 1.980 0.04851 *
## factor(mon)2 -9.424e-02 3.247e-01 -0.290 0.77180
## factor(mon)3 -1.099e+00 2.770e-01 -3.967 8.80e-05 ***
## factor(mon)4 -2.480e+00 2.941e-01 -8.433 8.28e-16 ***
## factor(mon)5 -9.449e-01 3.217e-01 -2.937 0.00352 **
## factor(mon)6 -2.390e-01 2.775e-01 -0.861 0.38974
## factor(mon)7 -3.521e-02 2.690e-01 -0.131 0.89593
## factor(mon)8 1.451e-01 2.327e-01 0.624 0.53322
## factor(mon)9 -5.139e-02 2.249e-01 -0.228 0.81939
## factor(mon)10 -2.160e-01 2.243e-01 -0.963 0.33615
## factor(mon)11 -2.139e-01 2.254e-01 -0.949 0.34312
## factor(mon)12 -1.965e-01 1.960e-01 -1.003 0.31662
## lag1 7.882e-03 2.876e-03 2.741 0.00644 **
## lag3 3.603e-03 2.123e-03 1.697 0.09052 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7076 on 360 degrees of freedom
## Multiple R-squared: 0.9688, Adjusted R-squared: 0.9662
## F-statistic: 384.9 on 29 and 360 DF, p-value: < 2.2e-16
##
## Breusch-Godfrey test for serial correlation of order up to 33
##
## data: Residuals
## LM test = 177.51, df = 33, p-value < 2.2e-16
The sqrt tranformation is also good fit model by R-squared value and residual analysis, However, it has lower R-squared value than no transformation model.
BoxCox Transformation
##
## Call:
## lm(formula = BoxCox ~ price + visit_count + basket_count + favored_count +
## category_visits + category_basket + ty_visits + factor(w_day) +
## factor(mon) + lag1 + lag3, data = train9)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.4824 -0.4164 -0.0942 0.4179 7.3159
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.153e+00 2.520e+01 -0.165 0.86917
## price 5.291e-04 4.193e-01 0.001 0.99899
## visit_count 2.403e-04 1.323e-04 1.816 0.07022 .
## basket_count 5.456e-03 1.676e-03 3.255 0.00124 **
## favored_count -1.583e-03 7.530e-04 -2.102 0.03626 *
## category_visits 2.171e-06 9.660e-07 2.247 0.02521 *
## category_basket 3.101e-06 1.087e-06 2.853 0.00459 **
## ty_visits 3.526e-08 5.591e-09 6.307 8.28e-10 ***
## factor(w_day)2 3.732e-01 2.661e-01 1.402 0.16164
## factor(w_day)3 5.536e-01 2.668e-01 2.075 0.03865 *
## factor(w_day)4 7.663e-01 2.673e-01 2.867 0.00438 **
## factor(w_day)5 8.076e-01 2.664e-01 3.032 0.00261 **
## factor(w_day)6 7.532e-01 2.663e-01 2.828 0.00494 **
## factor(w_day)7 5.979e-01 2.669e-01 2.240 0.02569 *
## factor(mon)2 1.031e+00 6.301e-01 1.637 0.10257
## factor(mon)3 -8.714e-01 5.402e-01 -1.613 0.10759
## factor(mon)4 -5.120e+00 5.691e-01 -8.997 < 2e-16 ***
## factor(mon)5 -1.495e+00 5.203e-01 -2.874 0.00430 **
## factor(mon)6 -1.056e-01 3.524e-01 -0.300 0.76452
## factor(mon)7 -6.734e-01 3.545e-01 -1.900 0.05825 .
## factor(mon)8 -6.263e-01 3.544e-01 -1.767 0.07803 .
## factor(mon)9 -6.531e-01 3.574e-01 -1.828 0.06844 .
## factor(mon)10 -6.609e-01 3.543e-01 -1.866 0.06291 .
## factor(mon)11 -6.200e-01 3.572e-01 -1.736 0.08344 .
## factor(mon)12 -6.594e-01 3.545e-01 -1.860 0.06364 .
## lag1 9.810e-03 5.592e-03 1.754 0.08020 .
## lag3 2.617e-03 4.146e-03 0.631 0.52842
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.386 on 363 degrees of freedom
## Multiple R-squared: 0.9192, Adjusted R-squared: 0.9134
## F-statistic: 158.9 on 26 and 363 DF, p-value: < 2.2e-16
##
## Breusch-Godfrey test for serial correlation of order up to 30
##
## data: Residuals
## LM test = 169.2, df = 30, p-value < 2.2e-16
BoxCox transformation is also can be good fit model since the adjusted R-square value high.
In all lm models the residuals is significantly corraleted in lag1 it is not desirable.
Arima Models
When arima models is constructed, the auto.arima function is used, and in every day the auto.arima function is runs again. the seasonality is TRUE, and frequency is determined as seven by observing ACF and PACF graph.
Additive Model, Multplive Model, and linear regression model is used for decomposition and get stationary data.
## [1] "The Additive Model"
##
## #######################################
## # KPSS Unit Root / Cointegration Test #
## #######################################
##
## The value of the test statistic is: 0.0082
## [1] "The Multiplicative Model"
##
## #######################################
## # KPSS Unit Root / Cointegration Test #
## #######################################
##
## The value of the test statistic is: 0.083
## [1] "Linear Regression"
##
## #######################################
## # KPSS Unit Root / Cointegration Test #
## #######################################
##
## The value of the test statistic is: 0.0267
I used the addtive model in examination, however, the mul model is also used in predictions and calculated error rate since it is significant at level = 0.05
Arima
## Series: decomposed$random
## ARIMA(0,0,2)(0,0,2)[7] with zero mean
##
## Coefficients:
## ma1 ma2 sma1 sma2
## 0.0166 -0.210 0.1241 0.1177
## s.e. 0.0661 0.076 0.0562 0.0586
##
## sigma^2 estimated as 101.9: log likelihood=-1430.87
## AIC=2871.74 AICc=2871.9 BIC=2891.49
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,2)(0,0,2)[7] with zero mean
## Q* = 61.582, df = 10, p-value = 1.816e-09
##
## Model df: 4. Total lags used: 14
Arima with Regressor
## Series: decomposed$random
## Regression with ARIMA(0,0,2)(1,0,2)[7] errors
##
## Coefficients:
## ma1 ma2 sar1 sma1 sma2 intercept xreg
## -0.0805 -0.3896 -0.8042 0.8776 0.2227 355.0113 -5.9076
## s.e. 0.0845 0.1029 0.0895 0.0983 0.0595 144.1929 2.3990
##
## sigma^2 estimated as 99.44: log likelihood=-1425.37
## AIC=2866.75 AICc=2867.13 BIC=2898.35
##
## Ljung-Box test
##
## data: Residuals from Regression with ARIMA(0,0,2)(1,0,2)[7] errors
## Q* = 74.376, df = 7, p-value = 1.92e-13
##
## Model df: 7. Total lags used: 14
Arima comined with Linear Regression
## Series: residuals
## ARIMA(1,0,0) with zero mean
##
## Coefficients:
## ar1
## 0.1640
## s.e. 0.0499
##
## sigma^2 estimated as 30.82: log likelihood=-1221.38
## AIC=2446.77 AICc=2446.8 BIC=2454.7
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,0) with zero mean
## Q* = 17.102, df = 9, p-value = 0.04714
##
## Model df: 1. Total lags used: 10
The all arima models have no significant corraleted residuals around zero. The arima model combined with linear regression is the lowest AIC value, therefore, it can be the best fit model.
## event_date actual sqrt_forecasted_sold BoxCox_forecasted_sold
## 1: 2021-06-19 26 40.71978 23.843419
## 2: 2021-06-20 15 37.83503 29.940431
## 3: 2021-06-21 20 18.10912 12.391943
## 4: 2021-06-22 47 18.99811 10.866454
## 5: 2021-06-23 40 23.85368 16.080255
## 6: 2021-06-24 37 22.29100 15.408298
## 7: 2021-06-25 20 21.12638 11.809304
## 8: 2021-06-26 27 15.25724 7.735327
## 9: 2021-06-27 20 29.69443 23.901945
## 10: 2021-06-28 26 16.28321 13.350304
## 11: 2021-06-29 19 29.98746 31.296925
## 12: 2021-06-30 20 29.00951 29.632218
## 13: 2021-07-01 14 20.42428 15.870863
## 14: 2021-07-02 8 14.63988 10.540106
## lm_forecasted_sold forecasted_lm9_arima add_arima_forecasted
## 1: 53.60680 51.82296 53.17379
## 2: 25.63597 21.51158 55.07792
## 3: 28.82212 31.34634 50.73318
## 4: 48.30746 42.48172 39.35532
## 5: 41.30164 42.75290 40.53021
## 6: 38.18430 35.96907 37.48216
## 7: 32.00696 34.95751 33.04259
## 8: 26.10417 23.68023 28.24629
## 9: 18.14258 20.09614 31.74235
## 10: 18.17077 16.38016 32.17666
## 11: 29.19542 30.60067 28.66079
## 12: 32.02036 29.43861 25.90133
## 13: 27.17432 26.75186 24.36589
## 14: 16.04085 11.84275 20.95494
## mul_arima_forecasted reg_add_arima_forecasted reg_mul_arima_forecasted
## 1: 38.06669 53.13962 37.64733
## 2: 74.23554 55.08468 74.99982
## 3: 37.16847 50.77435 46.35113
## 4: 32.56984 39.42061 31.36196
## 5: 53.92517 40.56712 54.00017
## 6: 38.03248 37.93574 39.40081
## 7: 27.91288 33.47467 29.19652
## 8: 23.01908 28.63710 23.45338
## 9: 40.43471 32.14315 41.18473
## 10: 22.61740 32.60195 23.04039
## 11: 23.99802 29.05578 24.46084
## 12: 35.12226 26.29765 35.80254
## 13: 24.92684 24.74094 25.39693
## 14: 18.02096 21.34210 18.35011
## model n mean sd CV FBias
## 1: sqrt_forecasted_sold 14 24.21429 10.72867 0.443072 0.002274018
## 2: BoxCox_forecasted_sold 14 24.21429 10.72867 0.443072 0.254667279
## 3: lm_forecasted_sold 14 24.21429 10.72867 0.443072 -0.282341375
## 4: forecasted_lm9_arima 14 24.21429 10.72867 0.443072 -0.237854005
## 5: add_arima_forecasted 14 24.21429 10.72867 0.443072 -0.479184122
## 6: mul_arima_forecasted 14 24.21429 10.72867 0.443072 -0.445576199
## 7: reg_add_arima_forecasted 14 24.21429 10.72867 0.443072 -0.490311090
## 8: reg_mul_arima_forecasted 14 24.21429 10.72867 0.443072 -0.488633222
## MAPE RMSE MAD MADP WMAPE
## 1: 0.5176652 13.67392 11.688884 0.4827268 0.4827268
## 2: 0.4853113 15.80493 12.621227 0.5212306 0.5212306
## 3: 0.4582578 10.87023 8.348478 0.3447749 0.3447749
## 4: 0.4219104 10.67494 8.400723 0.3469325 0.3469325
## 5: 0.7234923 17.11155 12.695198 0.5242855 0.5242855
## 6: 0.7644155 19.51808 13.902693 0.5741525 0.5741525
## 7: 0.7378675 17.23471 12.955303 0.5350273 0.5350273
## 8: 0.8187681 20.61137 14.995371 0.6192779 0.6192779
Linear Regression model with no transformation has the lowest WAMPE value therefore, it is selected for best fit model ,however,
In every day, the error rates are calculated for last 14 days and the model predictions and the model prediction has the lowest WMAPE value of is selected.
## add_arima mul_arima xreg_mul_arima xreg_add_arima
## 18.701038 22.265567 22.686007 19.097003
## forecast_lm forecast_lm_arima.1 BoxCox_lm Sqrt_lm
## 14.047509 10.743843 8.938846 12.659273
In order to predict one day ahead sales of the different products, different ARIMA and Linear Regression models have been tried and according to their performance on the test set, which consists of dates from 29 May 2021 to 11 June 2021, different models have been selected for each product. As external data, campaign dates of Trendyol is included, however since every campaign that of Trendyol is not included in the website, some of the outlier may have not been explained more correctly in the models, in order to improve the models, further investigation may be held. Also, the sales are affected from the overall component of the economy, so more external data could be included such as dollar exchange rate, for improved accuracy.
Approaching differently to each product is one of the strong sides of the model, since it is a time consuming task. Also trying various models and measuring their performances based on their predictions on the test data is also a strong side of the models that have been proposed for each product.
Overall, it can be said that models work fine, deviation from the real values is not too big.
Lecture Notes
The code of my study is available from here